Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new include_clique_leaders parameter to the /get_normalized_nodes endpoints (both GET and POST) to support deconflation use cases. When enabled, the API returns detailed information about individual clique leaders for conflated identifiers, helping users understand which cliques are being combined during gene/protein and drug/chemical conflation.
Changes:
- Added
include_clique_leadersboolean parameter to normalization endpoints - Modified normalization logic to collect and output clique leader information when requested
- Updated several variable names for clarity (e.g.,
types→types_with_ancestors) - Added LLM-generated docstrings to the
get_eqids_and_typesfunction
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| node_normalizer/server.py | Added include_clique_leaders query parameter to GET endpoint and passed it through to normalization logic |
| node_normalizer/model/input.py | Added include_clique_leaders field to CurieList input model for POST endpoint |
| node_normalizer/set_id.py | Updated call to get_normalized_nodes() to explicitly pass include_clique_leaders=False |
| node_normalizer/normalizer.py | Core implementation: collects clique leaders when conflation is enabled, generates clique leader output with metadata (identifier, conflation type, label, description, taxa, type), and includes it in response |
Comments suppressed due to low confidence (1)
node_normalizer/normalizer.py:558
- The docstring for this function is incomplete and doesn't describe the parameters, including the new
include_clique_leadersparameter. Given that the codebase uses docstring conventions (as seen inget_eqids_and_typesand other functions), this function's docstring should be updated to document all parameters and their purposes, particularly the new optional parameters that control output formatting.
"""
Get value(s) for key(s) using redis MGET
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if clique_leaders: | ||
| for conflation_type in clique_leaders: | ||
| if canonical_id in clique_leaders[conflation_type] and eqid["i"] in clique_leaders[conflation_type][canonical_id]: | ||
| clique_leader_output = { | ||
| "identifier": eqid["i"], | ||
| "conflation": conflation_type, | ||
| } | ||
| if "label" in eq_item: | ||
| clique_leader_output["label"] = eq_item["label"] | ||
|
|
||
| # For description, taxa and type, we could read them from eq_item, but that | ||
| # is only set if the appropriate flag was turned on. For completeness, let's | ||
| # try picking them up if they've been passed to us at all. | ||
| if "d" in eqid and len(eqid["d"]) > 0: | ||
| clique_leader_output["description"] = eqid["d"] | ||
| if "t" in eqid and eqid["t"]: | ||
| clique_leader_output["taxa"] = eqid["t"] | ||
| if 'types' in eqid: | ||
| clique_leader_output["type"] = eqid['types'][-1] | ||
| clique_leaders_output.append(clique_leader_output) |
There was a problem hiding this comment.
The loop structure here could be optimized. Currently, for every equivalent identifier, the code checks all conflation types to see if it's a clique leader. This could be improved by pre-computing a set of clique leaders for faster lookup, especially since the print statement on line 861 will execute for every single equivalent identifier in the response, which could be hundreds or thousands of times for large queries. Consider moving the clique leader check logic outside the main loop or optimizing it with a set-based lookup.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR closes #320 by adding an
include_clique_leadersoption on normalization. This may be a way to fix #340. Also renames some variables and adds some LLM-generated function documentation.This PR adds a new flag
include_clique_leadersthat can be set on both GET and POST/get_normalized_nodesendpoints. Activating this endpoint adds aclique_leaderskey to each normalized identifier that includes a list of all the clique leaders in this clique, along with their name, type and taxa, and (if the appropriate flags are turned on) descriptions. This doesn't currently include all the identifiers in each clique, but that should be added without too much extra bother.WIP
clique_leadersreally the best thing to call this thing? These are all clique leader identifiers, but maybeconflation_leadersor something else would be better?Example
Example output for
NCBIGene:1756is included below. Note that the conflation type is included (e.g."conflation": "GeneProtein") and thatclique_leadersis a list of the cliques leaders ordered in their position in the normalization.