Skip to content

Add clique leaders option#350

Draft
gaurav wants to merge 37 commits intomasterfrom
add-clique-leaders-option
Draft

Add clique leaders option#350
gaurav wants to merge 37 commits intomasterfrom
add-clique-leaders-option

Conversation

@gaurav
Copy link
Collaborator

@gaurav gaurav commented Dec 15, 2025

This PR closes #320 by adding an include_clique_leaders option on normalization. This may be a way to fix #340. Also renames some variables and adds some LLM-generated function documentation.

This PR adds a new flag include_clique_leaders that can be set on both GET and POST /get_normalized_nodes endpoints. Activating this endpoint adds a clique_leaders key to each normalized identifier that includes a list of all the clique leaders in this clique, along with their name, type and taxa, and (if the appropriate flags are turned on) descriptions. This doesn't currently include all the identifiers in each clique, but that should be added without too much extra bother.

WIP

  • Would it be useful to include the list of identifiers for each clique leader? That will require some additional finagling with the code, but it shouldn't be too problematic.
  • Is clique_leaders really the best thing to call this thing? These are all clique leader identifiers, but maybe conflation_leaders or something else would be better?
  • Add tests to Babel Validator

Example

Example output for NCBIGene:1756 is included below. Note that the conflation type is included (e.g. "conflation": "GeneProtein") and that clique_leaders is a list of the cliques leaders ordered in their position in the normalization.

{
  "NCBIGene:1756": {
    "id": {
      "identifier": "NCBIGene:1756",
      "label": "DMD",
      "description": "dystrophin"
    },
    "equivalent_identifiers": [
      {
        "identifier": "NCBIGene:1756",
        "label": "DMD",
        "description": "dystrophin",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Gene"
      },
      {
        "identifier": "ENSEMBL:ENSG00000198947",
        "type": "biolink:Gene"
      },
      {
        "identifier": "HGNC:2928",
        "label": "DMD",
        "type": "biolink:Gene"
      },
      {
        "identifier": "OMIM:300377",
        "type": "biolink:Gene"
      },
      {
        "identifier": "UMLS:C1414083",
        "label": "DMD gene",
        "type": "biolink:Gene"
      },
      {
        "identifier": "UniProtKB:A0A087WV90",
        "label": "A0A087WV90_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A0S2Z3B5",
        "label": "A0A0S2Z3B5_HUMAN Dystrophin isoform 2 (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A0S2Z3J7",
        "label": "A0A0S2Z3J7_HUMAN Dystrophin isoform 1 (Fragment) (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRP9",
        "label": "A0A5H1ZRP9_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRQ1",
        "label": "A0A5H1ZRQ1_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRQ8",
        "label": "A0A5H1ZRQ8_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRR9",
        "label": "A0A5H1ZRR9_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A804HKY9",
        "label": "A0A804HKY9_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A7E212",
        "label": "A7E212_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:P11532",
        "label": "DMD_HUMAN Dystrophin (sprot)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "PR:P11532",
        "label": "dystrophin (human)",
        "description": "A dystrophin that is encoded in the genome of human.",
        "type": "biolink:Protein"
      },
      {
        "identifier": "UMLS:C1437024",
        "label": "DMD protein, human",
        "type": "biolink:Protein"
      },
      {
        "identifier": "MESH:C484258",
        "label": "DMD protein, human",
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:Q16484",
        "label": "Q16484_HUMAN DMD protein (Fragment) (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:Q4G0X0",
        "label": "Q4G0X0_HUMAN DMD protein (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "ENSEMBL:ENSP00000288447",
        "type": "biolink:Protein"
      },
      {
        "identifier": "ENSEMBL:ENSP00000288447.4",
        "type": "biolink:Protein"
      }
    ],
    "descriptions": [
      "dystrophin",
      "A dystrophin that is encoded in the genome of human."
    ],
    "taxa": [
      "NCBITaxon:9606"
    ],
    "clique_leaders": [
      {
        "identifier": "NCBIGene:1756",
        "conflation": "GeneProtein",
        "label": "DMD",
        "description": [
          "dystrophin"
        ],
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Gene"
      },
      {
        "identifier": "UniProtKB:A0A087WV90",
        "conflation": "GeneProtein",
        "label": "A0A087WV90_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A0S2Z3B5",
        "conflation": "GeneProtein",
        "label": "A0A0S2Z3B5_HUMAN Dystrophin isoform 2 (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A0S2Z3J7",
        "conflation": "GeneProtein",
        "label": "A0A0S2Z3J7_HUMAN Dystrophin isoform 1 (Fragment) (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRP9",
        "conflation": "GeneProtein",
        "label": "A0A5H1ZRP9_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRQ1",
        "conflation": "GeneProtein",
        "label": "A0A5H1ZRQ1_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRQ8",
        "conflation": "GeneProtein",
        "label": "A0A5H1ZRQ8_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A5H1ZRR9",
        "conflation": "GeneProtein",
        "label": "A0A5H1ZRR9_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A0A804HKY9",
        "conflation": "GeneProtein",
        "label": "A0A804HKY9_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:A7E212",
        "conflation": "GeneProtein",
        "label": "A7E212_HUMAN Dystrophin (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:P11532",
        "conflation": "GeneProtein",
        "label": "DMD_HUMAN Dystrophin (sprot)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:Q16484",
        "conflation": "GeneProtein",
        "label": "Q16484_HUMAN DMD protein (Fragment) (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      },
      {
        "identifier": "UniProtKB:Q4G0X0",
        "conflation": "GeneProtein",
        "label": "Q4G0X0_HUMAN DMD protein (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ],
        "type": "biolink:Protein"
      }
    ],
    "type": [
      "biolink:Gene",
      "biolink:GeneOrGeneProduct",
      "biolink:GenomicEntity",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:PhysicalEssence",
      "biolink:OntologyClass",
      "biolink:BiologicalEntity",
      "biolink:ThingWithTaxon",
      "biolink:NamedThing",
      "biolink:PhysicalEssenceOrOccurrent",
      "biolink:MacromolecularMachineMixin",
      "biolink:Protein",
      "biolink:GeneProductMixin",
      "biolink:Polypeptide",
      "biolink:ChemicalEntityOrProteinOrPolypeptide"
    ],
    "information_content": 79.9
  }
}

@gaurav gaurav moved this from Backlog to In progress in Babel sprints Feb 18, 2026
@gaurav gaurav changed the base branch from master to add-nodenorm-version-to-status February 20, 2026 00:24
Base automatically changed from add-nodenorm-version-to-status to master February 20, 2026 00:31
@gaurav gaurav marked this pull request as ready for review February 25, 2026 01:39
@gaurav gaurav requested a review from Copilot February 25, 2026 01:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new include_clique_leaders parameter to the /get_normalized_nodes endpoints (both GET and POST) to support deconflation use cases. When enabled, the API returns detailed information about individual clique leaders for conflated identifiers, helping users understand which cliques are being combined during gene/protein and drug/chemical conflation.

Changes:

  • Added include_clique_leaders boolean parameter to normalization endpoints
  • Modified normalization logic to collect and output clique leader information when requested
  • Updated several variable names for clarity (e.g., typestypes_with_ancestors)
  • Added LLM-generated docstrings to the get_eqids_and_types function

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
node_normalizer/server.py Added include_clique_leaders query parameter to GET endpoint and passed it through to normalization logic
node_normalizer/model/input.py Added include_clique_leaders field to CurieList input model for POST endpoint
node_normalizer/set_id.py Updated call to get_normalized_nodes() to explicitly pass include_clique_leaders=False
node_normalizer/normalizer.py Core implementation: collects clique leaders when conflation is enabled, generates clique leader output with metadata (identifier, conflation type, label, description, taxa, type), and includes it in response
Comments suppressed due to low confidence (1)

node_normalizer/normalizer.py:558

  • The docstring for this function is incomplete and doesn't describe the parameters, including the new include_clique_leaders parameter. Given that the codebase uses docstring conventions (as seen in get_eqids_and_types and other functions), this function's docstring should be updated to document all parameters and their purposes, particularly the new optional parameters that control output formatting.
    """
    Get value(s) for key(s) using redis MGET
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +862 to +881
if clique_leaders:
for conflation_type in clique_leaders:
if canonical_id in clique_leaders[conflation_type] and eqid["i"] in clique_leaders[conflation_type][canonical_id]:
clique_leader_output = {
"identifier": eqid["i"],
"conflation": conflation_type,
}
if "label" in eq_item:
clique_leader_output["label"] = eq_item["label"]

# For description, taxa and type, we could read them from eq_item, but that
# is only set if the appropriate flag was turned on. For completeness, let's
# try picking them up if they've been passed to us at all.
if "d" in eqid and len(eqid["d"]) > 0:
clique_leader_output["description"] = eqid["d"]
if "t" in eqid and eqid["t"]:
clique_leader_output["taxa"] = eqid["t"]
if 'types' in eqid:
clique_leader_output["type"] = eqid['types'][-1]
clique_leaders_output.append(clique_leader_output)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop structure here could be optimized. Currently, for every equivalent identifier, the code checks all conflation types to see if it's a clique leader. This could be improved by pre-computing a set of clique leaders for faster lookup, especially since the print statement on line 861 will execute for every single equivalent identifier in the response, which could be hundreds or thousands of times for large queries. Consider moving the clique leader check logic outside the main loop or optimizing it with a set-based lookup.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@gaurav gaurav marked this pull request as draft February 25, 2026 01:48
@gaurav gaurav mentioned this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress
Status: Backlog

Development

Successfully merging this pull request may close these issues.

Deconflation endpoint Add option to provide clique leaders in addition to the combined clique

2 participants