Skip to content

feat(medcat-trainer): Startup Provisioning feature to initialize from config#358

Merged
alhendrickson merged 10 commits intomainfrom
feat/medcat-trainer/startup-provisioning
Mar 3, 2026
Merged

feat(medcat-trainer): Startup Provisioning feature to initialize from config#358
alhendrickson merged 10 commits intomainfrom
feat/medcat-trainer/startup-provisioning

Conversation

@alhendrickson
Copy link
Collaborator

@alhendrickson alhendrickson commented Mar 2, 2026

What does this do

You can now provide a yaml file with details of projects to create on startup. "Provisioning" appears to be the term for this.

This is a reworking of the exisitng load_examples.py . Most of it is actually the same just moved around.

By default it should all behave the exact same way as it does today, pulling files from s3.

Why do this?

This allows trainer to be setup declaratively instead of needing manual steps. The primary aim for this is using it in kubernetes - I can have a one line install that runs trainer, and sets up projects already linked to medcat services. The yaml for the projects will be defined in a values.yaml file

Early days for this though. Next steps may be to also provision users, and allow all the parameters projects etc can use. Plan would be to repeat this pattern in other apps/services too.

If things get out of hand, making an "operator" is probably the end goal here - create yaml & it then magically would create projects.

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with it as far as I can tell.

But traditionally, projects have comprised of datasets and model packs (and now links to services that do the same thing). And the latter two have been independent of the projects. I.e normally a trainer instance has a small number of model packs and a large number of datasets. And then each project uses a model pack (or now a URL) + a dataset.
Maybe there's a reason to tie them together for provisioning, but it just seems to go counter to my normal understanding of how the trainer is used.

EDIT:
For reference, normally "we" provide a set of model packs (or back in the day, CDBs), a clinician separately provides the dataset(s). And then the clinician (or someone else on their instruction) creates the project(s) to be annotated.
EDIT2:
But I'm not working with trainer super tightly so it's possible my understanding of these workflows is flawed.

)


class ProvisioningProjectSpec(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, there is no real reason to tie these things together as far as a I can tell.
In fact, normally you would upload one model pack and link it to multiple annotations projects (effectively datasets). And you can also have multiple projects use the same dataset(s). And you can (though don't know if this is common) have multiple projects use the same datasets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense! I think this will be good to have to make it much more useful. https://app.clickup.com/t/869cba2h9

class ProjectSpec(BaseModel):
"""Project to create via project-annotate-entities/."""

model_config = _common_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will work because pydantic does some magic and makes different instances of the comon config.
However, from an initial view it would seem that this ties all the models to the same config instance (which would suggest changing one config chnages them all).

Just a note, really.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this up shortly. Aim was to just not repeat the to_camel part essentially https://app.clickup.com/t/869cba2m6](https://app.clickup.com/t/869cba2m6)

@alhendrickson alhendrickson merged commit be9825f into main Mar 3, 2026
10 checks passed
@alhendrickson alhendrickson deleted the feat/medcat-trainer/startup-provisioning branch March 3, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants