Skip to content

Update version to 0.19.3 and enhance gene symbol handling#72

Merged
parashardhapola merged 1 commit intomasterfrom
gene_symbol_handling
Mar 8, 2026
Merged

Update version to 0.19.3 and enhance gene symbol handling#72
parashardhapola merged 1 commit intomasterfrom
gene_symbol_handling

Conversation

@parashardhapola
Copy link
Member

@parashardhapola parashardhapola commented Mar 8, 2026

  • Bump package version to 0.19.3.
  • Introduce materialization of a canonical gene symbols column in AnnData, improving gene symbol management.
  • Refactor CyteType class initialization to handle gene symbols more flexibly, including support for temporary columns.
  • Update save_features_matrix to conditionally include gene symbols metadata in output files.
  • Enhance tests to validate new gene symbol handling and ensure proper functionality.

- Bump package version to 0.19.3.
- Introduce materialization of a canonical gene symbols column in AnnData, improving gene symbol management.
- Refactor CyteType initialization to handle gene symbols more flexibly, including support for temporary columns.
- Update save_features_matrix to conditionally include gene symbols metadata in output files.
- Enhance tests to validate new gene symbol handling and ensure proper functionality.
@parashardhapola parashardhapola changed the title Update version to 0.19.3 and enhance gene symbol handling in CyteType Update version to 0.19.3 and enhance gene symbol handling Mar 8, 2026
@qodo-code-review
Copy link

Review Summary by Qodo

Enhance gene symbol handling with canonical column materialization

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Bump package version to 0.19.3
• Introduce materialization of canonical gene symbols column in AnnData for flexible gene symbol
  handling
• Refactor CyteType initialization to support temporary gene symbol columns with proper cleanup
• Update save_features_matrix to conditionally include gene symbols metadata in output files
• Add comprehensive tests validating new gene symbol handling and cleanup behavior
Diagram
flowchart LR
  A["Gene Symbol Resolution"] --> B["Materialize Canonical Column"]
  B --> C["Temporary Column in adata.var"]
  C --> D["CyteType Initialization"]
  D --> E["save_features_matrix"]
  E --> F["Store gene_symbols_column metadata"]
  D --> G["Cleanup on Error"]
  G --> H["Remove Temporary Column"]
  D --> I["cleanup Method"]
  I --> H
Loading

Grey Divider

File Changes

1. cytetype/__init__.py ⚙️ Configuration changes +1/-1

Bump version to 0.19.3

• Update package version from 0.19.2 to 0.19.3

cytetype/init.py


2. cytetype/core/artifacts.py ✨ Enhancement +5/-0

Add gene symbols metadata to artifact files

• Add gene_symbols_column parameter to _write_var_metadata function
• Store gene_symbols_column as HDF5 attribute in var group when provided
• Add gene_symbols_column parameter to save_features_matrix function
• Pass gene_symbols_column through to _write_var_metadata call

cytetype/core/artifacts.py


3. cytetype/main.py ✨ Enhancement +150/-120

Refactor gene symbol handling with temporary column support

• Remove import of clean_gene_names function
• Add import of materialize_canonical_gene_symbols_column from validation module
• Add instance variables for tracking temporary and original gene symbol columns
• Wrap initialization logic in try-except block with cleanup on failure
• Call materialize_canonical_gene_symbols_column to create temporary canonical column
• Remove clean_gene_names call and use materialized column directly
• Pass gene_symbols_column parameter to save_features_matrix
• Add _cleanup_temporary_gene_symbols_column method to remove temporary columns
• Update cleanup method to call temporary column cleanup

cytetype/main.py


View more (3)
4. cytetype/preprocessing/validation.py ✨ Enhancement +33/-0

Add canonical gene symbols column materialization

• Add _CANONICAL_GENE_SYMBOLS_COLUMN constant for temporary column naming
• Add _temporary_gene_symbols_column_name function to generate unique column names
• Add materialize_canonical_gene_symbols_column function to create canonical gene symbols column
• Function handles both explicit gene symbol columns and var_names as sources
• Applies clean_gene_names during materialization

cytetype/preprocessing/validation.py


5. tests/test_artifacts.py 🧪 Tests +39/-0

Test gene symbols metadata in artifacts

• Add gene_symbols_column parameter to existing test
• Add assertion to verify gene_symbols_column attribute is stored in HDF5
• Add test for omitting gene symbols attribute when not provided
• Add test for omitting gene symbols attribute when parameter is omitted

tests/test_artifacts.py


6. tests/test_cytetype_integration.py 🧪 Tests +85/-1

Test CyteType gene symbol materialization and cleanup

• Add scanpy import for test utilities
• Update initialization test to verify temporary canonical column creation
• Add test for materializing canonical column from composite gene symbols
• Add test for materializing canonical column from composite var_names
• Add test verifying cleanup removes temporary column and restores original
• Add test verifying rollback of temporary column on initialization failure

tests/test_cytetype_integration.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 8, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Gene symbol collisions overwrite 🐞 Bug ✓ Correctness
Description
Canonical gene-symbol materialization can produce empty strings (from NaNs) and can collapse
multiple IDs to the same symbol, but expression_percentages is keyed by gene name; duplicates will
silently overwrite earlier genes and drop expression data from the payload.
Code

cytetype/preprocessing/validation.py[R90-110]

+def materialize_canonical_gene_symbols_column(
+    adata: anndata.AnnData, gene_symbols_column: str | None
+) -> tuple[str, str | None]:
+    """Create a temporary canonical gene-symbol column in ``adata.var``."""
+    if gene_symbols_column is None:
+        source_values = adata.var_names.astype(str).tolist()
+        source_name = "adata.var_names"
+    else:
+        source_values = [
+            str(value)
+            for value in adata.var[gene_symbols_column].astype("string").fillna("")
+        ]
+        source_name = f"column '{gene_symbols_column}'"
+
+    canonical_column = _temporary_gene_symbols_column_name(adata)
+    adata.var[canonical_column] = clean_gene_names(source_values)
+    logger.info(
+        f"Materialized canonical gene symbols in temporary column '{canonical_column}' "
+        f"from {source_name}."
+    )
+    return canonical_column, gene_symbols_column
Evidence
materialize_canonical_gene_symbols_column() fills missing values with "" and then applies
clean_gene_names(), which can collapse composite identifiers to a single symbol. CyteType then uses
that column to build gene_names for aggregate_expression_percentages(), which stores results in a
dict keyed by gene name—so duplicate/empty names overwrite prior entries with no warning.

cytetype/preprocessing/validation.py[90-110]
cytetype/main.py[148-184]
cytetype/preprocessing/aggregation.py[46-52]
cytetype/preprocessing/validation.py[52-78]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`materialize_canonical_gene_symbols_column()` can emit duplicate or empty canonical symbols (e.g., NaNs -> "" and composite IDs collapsing to the same gene symbol). `aggregate_expression_percentages()` stores results in a dict keyed by gene name, so duplicates silently overwrite earlier entries, losing expression data.

## Issue Context
- Canonical column is used as the source for `gene_names` in `CyteType.__init__`.
- Expression percentages are keyed by `gene_names`.

## Fix Focus Areas
- cytetype/preprocessing/validation.py[90-110]
- cytetype/main.py[157-184]
- cytetype/preprocessing/aggregation.py[46-52]

## Suggested direction
1. In `materialize_canonical_gene_symbols_column()`, after computing cleaned values:
  - Replace empty/whitespace-only symbols with a stable fallback (e.g., the corresponding `adata.var_names`).
  - Detect duplicates; either:
    - disambiguate (e.g., append a suffix based on position or var_name), or
    - aggregate duplicates in `aggregate_expression_percentages()` instead of overwriting (e.g., max/mean across colliding columns), depending on intended semantics.
2. Add a warning log if duplicates were present and how they were handled.
3. Add a targeted test covering: missing symbols + duplicated canonical symbols leads to deterministic, non-lossy behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Mixed composite mismatch 🐞 Bug ✓ Correctness
Description
Canonicalization uses clean_gene_names(), which only extracts symbols if >50% of values look
composite; marker-gene extraction always splits composites per value. In mixed datasets (<50%
composite), expressionData can keep composite names while markerGenes are simplified, creating
inconsistent gene naming within one payload.
Code

cytetype/preprocessing/validation.py[R90-110]

+def materialize_canonical_gene_symbols_column(
+    adata: anndata.AnnData, gene_symbols_column: str | None
+) -> tuple[str, str | None]:
+    """Create a temporary canonical gene-symbol column in ``adata.var``."""
+    if gene_symbols_column is None:
+        source_values = adata.var_names.astype(str).tolist()
+        source_name = "adata.var_names"
+    else:
+        source_values = [
+            str(value)
+            for value in adata.var[gene_symbols_column].astype("string").fillna("")
+        ]
+        source_name = f"column '{gene_symbols_column}'"
+
+    canonical_column = _temporary_gene_symbols_column_name(adata)
+    adata.var[canonical_column] = clean_gene_names(source_values)
+    logger.info(
+        f"Materialized canonical gene symbols in temporary column '{canonical_column}' "
+        f"from {source_name}."
+    )
+    return canonical_column, gene_symbols_column
Evidence
The canonical column is produced via clean_gene_names(), which is gated by a >50% composite
heuristic. However, when a gene_symbols_col is provided, extract_marker_genes always runs
_extract_symbol_from_composite for every value. Therefore, if composites exist but don’t exceed the
threshold, expression_percentages keys may remain composite strings while marker genes are
de-composited.

cytetype/preprocessing/validation.py[64-78]
cytetype/preprocessing/validation.py[40-49]
cytetype/preprocessing/validation.py[90-110]
cytetype/preprocessing/extraction.py[49-53]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`clean_gene_names()` only extracts symbols when &gt;50% of values appear composite, but marker extraction always applies per-value composite splitting. For mixed inputs, this can leave composite strings in the canonical column (hence expressionData keys) while markerGenes are simplified—creating inconsistent gene naming inside the same request.

## Issue Context
- Canonical column is created via `clean_gene_names(source_values)`.
- Marker extraction applies `_extract_symbol_from_composite()` unconditionally when `gene_symbols_col` is provided.

## Fix Focus Areas
- cytetype/preprocessing/validation.py[40-49]
- cytetype/preprocessing/validation.py[64-78]
- cytetype/preprocessing/validation.py[90-110]
- cytetype/preprocessing/extraction.py[49-53]

## Suggested direction
- In `materialize_canonical_gene_symbols_column()`, consider generating canonical values with per-value extraction:
 - `cleaned = [_extract_symbol_from_composite(v) for v in source_values]` (safe for non-composite values because the helper returns the original string when not composite).
 - Optionally keep the current threshold only for logging (e.g., log how many values changed), but avoid gating the actual transformation.
- Add/adjust tests for a mixed composite case (e.g., 20% composite) asserting that expressionData keys and markerGenes names are consistent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@parashardhapola parashardhapola merged commit e27bd80 into master Mar 8, 2026
1 check passed
@parashardhapola parashardhapola deleted the gene_symbol_handling branch March 8, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant