feat(medcat-trainer): Startup Provisioning feature to initialize from config#358
Conversation
… config - refactor load_examples & test
… config - add docs
… config - add docs
…m config - fix test
…m config - fix test
…m config - fix run.sh
…m config - fix test
There was a problem hiding this comment.
Nothing wrong with it as far as I can tell.
But traditionally, projects have comprised of datasets and model packs (and now links to services that do the same thing). And the latter two have been independent of the projects. I.e normally a trainer instance has a small number of model packs and a large number of datasets. And then each project uses a model pack (or now a URL) + a dataset.
Maybe there's a reason to tie them together for provisioning, but it just seems to go counter to my normal understanding of how the trainer is used.
EDIT:
For reference, normally "we" provide a set of model packs (or back in the day, CDBs), a clinician separately provides the dataset(s). And then the clinician (or someone else on their instruction) creates the project(s) to be annotated.
EDIT2:
But I'm not working with trainer super tightly so it's possible my understanding of these workflows is flawed.
| ) | ||
|
|
||
|
|
||
| class ProvisioningProjectSpec(BaseModel): |
There was a problem hiding this comment.
In principle, there is no real reason to tie these things together as far as a I can tell.
In fact, normally you would upload one model pack and link it to multiple annotations projects (effectively datasets). And you can also have multiple projects use the same dataset(s). And you can (though don't know if this is common) have multiple projects use the same datasets.
There was a problem hiding this comment.
That makes a lot of sense! I think this will be good to have to make it much more useful. https://app.clickup.com/t/869cba2h9
| class ProjectSpec(BaseModel): | ||
| """Project to create via project-annotate-entities/.""" | ||
|
|
||
| model_config = _common_config |
There was a problem hiding this comment.
I think this will work because pydantic does some magic and makes different instances of the comon config.
However, from an initial view it would seem that this ties all the models to the same config instance (which would suggest changing one config chnages them all).
Just a note, really.
There was a problem hiding this comment.
Will fix this up shortly. Aim was to just not repeat the to_camel part essentially https://app.clickup.com/t/869cba2m6](https://app.clickup.com/t/869cba2m6)
What does this do
You can now provide a yaml file with details of projects to create on startup. "Provisioning" appears to be the term for this.
This is a reworking of the exisitng load_examples.py . Most of it is actually the same just moved around.
By default it should all behave the exact same way as it does today, pulling files from s3.
Why do this?
This allows trainer to be setup declaratively instead of needing manual steps. The primary aim for this is using it in kubernetes - I can have a one line install that runs trainer, and sets up projects already linked to medcat services. The yaml for the projects will be defined in a
values.yamlfileEarly days for this though. Next steps may be to also provision users, and allow all the parameters projects etc can use. Plan would be to repeat this pattern in other apps/services too.
If things get out of hand, making an "operator" is probably the end goal here - create yaml & it then magically would create projects.