fix: Fix the sync issue between DataMate & LabelStudio; Fix the tag update issue in DataMate#399
fix: Fix the sync issue between DataMate & LabelStudio; Fix the tag update issue in DataMate#399
Conversation
…pdate issue in DataMate
There was a problem hiding this comment.
Pull request overview
This PR addresses tag-format inconsistencies and synchronization edge-cases between DataMate (DM) and Label Studio (LS), aiming to make tag updates and annotation sync more deterministic and compatible with the values-based internal tag schema.
Changes:
- Normalize/merge DM file tags more robustly during partial updates (handle
value→values, type/value key normalization, semantic matching). - Improve LS↔DM annotation sync orchestration by optionally syncing files first and adding payload-based forcing logic for equal timestamps.
- Align schemas/tests/utilities to use
valuesconsistently and improve key/type normalization.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| runtime/datamate-python/app/module/dataset/service/service.py | Enhances tag partial-update merge logic with normalization and semantic matching. |
| runtime/datamate-python/app/module/dataset/schema/dataset_file.py | Enables alias population and normalizes values/type lookup in tag parsing. |
| runtime/datamate-python/app/module/annotation/utils/test_tag_converter.py | Updates tests to assert values instead of value. |
| runtime/datamate-python/app/module/annotation/utils/tag_converter.py | Emits values wrapper and fixes simplified/full format detection. |
| runtime/datamate-python/app/module/annotation/service/sync.py | Adds deterministic LS payload comparison and pre-sync file sync hooks for annotation sync paths. |
| runtime/datamate-python/app/module/annotation/service/ls_annotation_sync.py | Normalizes LS result keys/types and includes to_name in extracted tags. |
| runtime/datamate-python/app/module/annotation/interface/task.py | Threads sync_files_first through API endpoints. |
| runtime/datamate-python/app/module/annotation/interface/project.py | Switches manual sync-db flow to the SyncService orchestrator. |
| runtime/datamate-python/app/module/annotation/interface/auto.py | Switches auto sync-db flow to file-diff sync + bidirectional annotation sync. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| and ls_annotations | ||
| ): | ||
| latest_ls_result_payload = latest_ls_annotation.get("result", []) if latest_ls_annotation else [] | ||
| if not self._ls_result_payload_equal(ls_result, latest_ls_result_payload): |
There was a problem hiding this comment.
In the equal-timestamp conflict path, ls_result is referenced before it is constructed. Specifically, the payload comparison calls _ls_result_payload_equal(ls_result, latest_ls_result_payload) before ls_result is assigned later in the function, which will raise UnboundLocalError when timestamps are equal and overwrite_ls is true. Build the DM→LS ls_result (or a normalized equivalent) before this comparison, or compare using the raw dm_tags/a helper that doesn’t depend on ls_result being defined yet.
| if not self._ls_result_payload_equal(ls_result, latest_ls_result_payload): | |
| if "ls_result" in locals() and not self._ls_result_payload_equal(ls_result, latest_ls_result_payload): |
| def _normalize_tag(raw_tag: Dict[str, Any]) -> Dict[str, Any]: | ||
| normalized = dict(raw_tag) | ||
| if "values" not in normalized and isinstance(normalized.get("value"), dict): | ||
| normalized["values"] = normalized.get("value") | ||
| normalized.pop("value", None) | ||
|
|
There was a problem hiding this comment.
_normalize_tag normalizes value→values and lower-cases type, but it doesn’t normalize camelCase field names (fromName/toName) into the canonical snake_case (from_name/to_name). Later merge/backfill logic only checks from_name/to_name, which can lead to tags being stored with both key variants (or missing the snake_case keys), making matching and downstream consumers inconsistent. Consider normalizing/remapping fromName→from_name and toName→to_name inside _normalize_tag (and dropping the camelCase keys).
| full_tag = { | ||
| "id": tag_id, | ||
| "from_name": from_name, | ||
| "to_name": to_name, | ||
| "type": control_type, | ||
| "values": { | ||
| control_type: values | ||
| } | ||
| }, | ||
| } |
There was a problem hiding this comment.
The module-level docstring still describes the internal tag format as using a value wrapper, but the converter now emits/stores values (and the rest of the codebase expects values). Please update the docstring/examples to match the current internal schema to avoid misleading future readers.
| tag = { | ||
| "id": r.get("id"), | ||
| "type": r_type, | ||
| "from_name": from_name, | ||
| "to_name": to_name, | ||
| "values": values, | ||
| } |
There was a problem hiding this comment.
_extract_tags_from_results now includes to_name in the normalized tag payload, but the docstring above still lists the output fields as id/type/from_name/values only. Update the docstring to include to_name so callers and future maintainers have an accurate contract.
| dm_client = DatasetManagementService(db) | ||
| sync_orchestrator = SyncService(dm_client, ls_client, mapping_service) | ||
| file_sync_result = await sync_orchestrator.sync_files(mapping, batch_size=50) |
There was a problem hiding this comment.
This endpoint now uses SyncService for manual sync, so LSAnnotationSyncService appears unused in this module (imported at the top but not referenced). Please remove the unused import to avoid lint/type-check failures and keep dependencies clear.
| # 3. 先执行文件差异同步,确保 LS 工程任务集合与当前自动标注任务文件集合一致 | ||
| dm_client = DatasetManagementService(db) | ||
| sync_orchestrator = SyncService(dm_client, ls_client, mapping_service) | ||
|
|
There was a problem hiding this comment.
LSAnnotationSyncService is imported in this module but is no longer referenced after switching the /sync-db flow to SyncService orchestration. Please remove the unused import to avoid lint/type-check failures.
No description provided.