Update version to 0.19.3 and enhance gene symbol handling by parashardhapola · Pull Request #72 · NygenAnalytics/CyteType

parashardhapola · 2026-03-08T21:46:17Z

Bump package version to 0.19.3.
Introduce materialization of a canonical gene symbols column in AnnData, improving gene symbol management.
Refactor CyteType class initialization to handle gene symbols more flexibly, including support for temporary columns.
Update save_features_matrix to conditionally include gene symbols metadata in output files.
Enhance tests to validate new gene symbol handling and ensure proper functionality.

- Bump package version to 0.19.3. - Introduce materialization of a canonical gene symbols column in AnnData, improving gene symbol management. - Refactor CyteType initialization to handle gene symbols more flexibly, including support for temporary columns. - Update save_features_matrix to conditionally include gene symbols metadata in output files. - Enhance tests to validate new gene symbol handling and ensure proper functionality.

qodo-code-review · 2026-03-08T21:46:33Z

Review Summary by Qodo

Enhance gene symbol handling with canonical column materialization

✨ Enhancement 🧪 Tests

Walkthroughs

Description

• Bump package version to 0.19.3
• Introduce materialization of canonical gene symbols column in AnnData for flexible gene symbol
  handling
• Refactor CyteType initialization to support temporary gene symbol columns with proper cleanup
• Update save_features_matrix to conditionally include gene symbols metadata in output files
• Add comprehensive tests validating new gene symbol handling and cleanup behavior

Diagram

flowchart LR
  A["Gene Symbol Resolution"] --> B["Materialize Canonical Column"]
  B --> C["Temporary Column in adata.var"]
  C --> D["CyteType Initialization"]
  D --> E["save_features_matrix"]
  E --> F["Store gene_symbols_column metadata"]
  D --> G["Cleanup on Error"]
  G --> H["Remove Temporary Column"]
  D --> I["cleanup Method"]
  I --> H

File Changes

1. cytetype/__init__.py ⚙️ Configuration changes +1/-1

Bump version to 0.19.3
• Update package version from 0.19.2 to 0.19.3
cytetype/init.py

2. cytetype/core/artifacts.py ✨ Enhancement +5/-0

Add gene symbols metadata to artifact files

• Add gene_symbols_column parameter to _write_var_metadata function
• Store gene_symbols_column as HDF5 attribute in var group when provided
• Add gene_symbols_column parameter to save_features_matrix function
• Pass gene_symbols_column through to _write_var_metadata call

cytetype/core/artifacts.py

3. cytetype/main.py ✨ Enhancement +150/-120

Refactor gene symbol handling with temporary column support

• Remove import of clean_gene_names function
• Add import of materialize_canonical_gene_symbols_column from validation module
• Add instance variables for tracking temporary and original gene symbol columns
• Wrap initialization logic in try-except block with cleanup on failure
• Call materialize_canonical_gene_symbols_column to create temporary canonical column
• Remove clean_gene_names call and use materialized column directly
• Pass gene_symbols_column parameter to save_features_matrix
• Add _cleanup_temporary_gene_symbols_column method to remove temporary columns
• Update cleanup method to call temporary column cleanup

cytetype/main.py

View more (3)

4. cytetype/preprocessing/validation.py ✨ Enhancement +33/-0

Add canonical gene symbols column materialization

• Add _CANONICAL_GENE_SYMBOLS_COLUMN constant for temporary column naming
• Add _temporary_gene_symbols_column_name function to generate unique column names
• Add materialize_canonical_gene_symbols_column function to create canonical gene symbols column
• Function handles both explicit gene symbol columns and var_names as sources
• Applies clean_gene_names during materialization

cytetype/preprocessing/validation.py

5. tests/test_artifacts.py 🧪 Tests +39/-0

Test gene symbols metadata in artifacts

• Add gene_symbols_column parameter to existing test
• Add assertion to verify gene_symbols_column attribute is stored in HDF5
• Add test for omitting gene symbols attribute when not provided
• Add test for omitting gene symbols attribute when parameter is omitted

tests/test_artifacts.py

6. tests/test_cytetype_integration.py 🧪 Tests +85/-1

Test CyteType gene symbol materialization and cleanup

• Add scanpy import for test utilities
• Update initialization test to verify temporary canonical column creation
• Add test for materializing canonical column from composite gene symbols
• Add test for materializing canonical column from composite var_names
• Add test verifying cleanup removes temporary column and restores original
• Add test verifying rollback of temporary column on initialization failure

tests/test_cytetype_integration.py

qodo-code-review · 2026-03-08T21:46:34Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Gene symbol collisions overwrite 🐞 Bug ✓ Correctness

Description

Canonical gene-symbol materialization can produce empty strings (from NaNs) and can collapse
multiple IDs to the same symbol, but expression_percentages is keyed by gene name; duplicates will
silently overwrite earlier genes and drop expression data from the payload.

Code

cytetype/preprocessing/validation.py[R90-110]

+def materialize_canonical_gene_symbols_column(
+    adata: anndata.AnnData, gene_symbols_column: str | None
+) -> tuple[str, str | None]:
+    """Create a temporary canonical gene-symbol column in ``adata.var``."""
+    if gene_symbols_column is None:
+        source_values = adata.var_names.astype(str).tolist()
+        source_name = "adata.var_names"
+    else:
+        source_values = [
+            str(value)
+            for value in adata.var[gene_symbols_column].astype("string").fillna("")
+        ]
+        source_name = f"column '{gene_symbols_column}'"
+
+    canonical_column = _temporary_gene_symbols_column_name(adata)
+    adata.var[canonical_column] = clean_gene_names(source_values)
+    logger.info(
+        f"Materialized canonical gene symbols in temporary column '{canonical_column}' "
+        f"from {source_name}."
+    )
+    return canonical_column, gene_symbols_column

Evidence
materialize_canonical_gene_symbols_column() fills missing values with "" and then applies
clean_gene_names(), which can collapse composite identifiers to a single symbol. CyteType then uses
that column to build gene_names for aggregate_expression_percentages(), which stores results in a
dict keyed by gene name—so duplicate/empty names overwrite prior entries with no warning.
cytetype/preprocessing/validation.py[90-110]
cytetype/main.py[148-184]
cytetype/preprocessing/aggregation.py[46-52]
cytetype/preprocessing/validation.py[52-78]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`materialize_canonical_gene_symbols_column()` can emit duplicate or empty canonical symbols (e.g., NaNs -&gt; &quot;&quot; and composite IDs collapsing to the same gene symbol). `aggregate_expression_percentages()` stores results in a dict keyed by gene name, so duplicates silently overwrite earlier entries, losing expression data.

## Issue Context
- Canonical column is used as the source for `gene_names` in `CyteType.__init__`.
- Expression percentages are keyed by `gene_names`.

## Fix Focus Areas
- cytetype/preprocessing/validation.py[90-110]
- cytetype/main.py[157-184]
- cytetype/preprocessing/aggregation.py[46-52]

## Suggested direction
1. In `materialize_canonical_gene_symbols_column()`, after computing cleaned values:
  - Replace empty/whitespace-only symbols with a stable fallback (e.g., the corresponding `adata.var_names`).
  - Detect duplicates; either:
    - disambiguate (e.g., append a suffix based on position or var_name), or
    - aggregate duplicates in `aggregate_expression_percentages()` instead of overwriting (e.g., max/mean across colliding columns), depending on intended semantics.
2. Add a warning log if duplicates were present and how they were handled.
3. Add a targeted test covering: missing symbols + duplicated canonical symbols leads to deterministic, non-lossy behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Mixed composite mismatch 🐞 Bug ✓ Correctness

Description

Canonicalization uses clean_gene_names(), which only extracts symbols if >50% of values look
composite; marker-gene extraction always splits composites per value. In mixed datasets (<50%
composite), expressionData can keep composite names while markerGenes are simplified, creating
inconsistent gene naming within one payload.

Code

cytetype/preprocessing/validation.py[R90-110]

+def materialize_canonical_gene_symbols_column(
+    adata: anndata.AnnData, gene_symbols_column: str | None
+) -> tuple[str, str | None]:
+    """Create a temporary canonical gene-symbol column in ``adata.var``."""
+    if gene_symbols_column is None:
+        source_values = adata.var_names.astype(str).tolist()
+        source_name = "adata.var_names"
+    else:
+        source_values = [
+            str(value)
+            for value in adata.var[gene_symbols_column].astype("string").fillna("")
+        ]
+        source_name = f"column '{gene_symbols_column}'"
+
+    canonical_column = _temporary_gene_symbols_column_name(adata)
+    adata.var[canonical_column] = clean_gene_names(source_values)
+    logger.info(
+        f"Materialized canonical gene symbols in temporary column '{canonical_column}' "
+        f"from {source_name}."
+    )
+    return canonical_column, gene_symbols_column

Evidence
The canonical column is produced via clean_gene_names(), which is gated by a >50% composite
heuristic. However, when a gene_symbols_col is provided, extract_marker_genes always runs
_extract_symbol_from_composite for every value. Therefore, if composites exist but don’t exceed the
threshold, expression_percentages keys may remain composite strings while marker genes are
de-composited.
cytetype/preprocessing/validation.py[64-78]
cytetype/preprocessing/validation.py[40-49]
cytetype/preprocessing/validation.py[90-110]
cytetype/preprocessing/extraction.py[49-53]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`clean_gene_names()` only extracts symbols when &gt;50% of values appear composite, but marker extraction always applies per-value composite splitting. For mixed inputs, this can leave composite strings in the canonical column (hence expressionData keys) while markerGenes are simplified—creating inconsistent gene naming inside the same request.

## Issue Context
- Canonical column is created via `clean_gene_names(source_values)`.
- Marker extraction applies `_extract_symbol_from_composite()` unconditionally when `gene_symbols_col` is provided.

## Fix Focus Areas
- cytetype/preprocessing/validation.py[40-49]
- cytetype/preprocessing/validation.py[64-78]
- cytetype/preprocessing/validation.py[90-110]
- cytetype/preprocessing/extraction.py[49-53]

## Suggested direction
- In `materialize_canonical_gene_symbols_column()`, consider generating canonical values with per-value extraction:
 - `cleaned = [_extract_symbol_from_composite(v) for v in source_values]` (safe for non-composite values because the helper returns the original string when not composite).
 - Optionally keep the current threshold only for logging (e.g., log how many values changed), but avoid gating the actual transformation.
- Add/adjust tests for a mixed composite case (e.g., 20% composite) asserting that expressionData keys and markerGenes names are consistent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

cytetype/preprocessing/validation.py

parashardhapola changed the title ~~Update version to 0.19.3 and enhance gene symbol handling in CyteType~~ Update version to 0.19.3 and enhance gene symbol handling Mar 8, 2026

qodo-code-review bot reviewed Mar 8, 2026

View reviewed changes

cytetype/preprocessing/validation.py Show resolved Hide resolved

parashardhapola merged commit e27bd80 into master Mar 8, 2026
1 check passed

parashardhapola deleted the gene_symbol_handling branch March 8, 2026 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update version to 0.19.3 and enhance gene symbol handling#72

Update version to 0.19.3 and enhance gene symbol handling#72
parashardhapola merged 1 commit intomasterfrom
gene_symbol_handling

parashardhapola commented Mar 8, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Mar 8, 2026

Uh oh!

qodo-code-review bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

parashardhapola commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qodo-code-review bot commented Mar 8, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

parashardhapola commented Mar 8, 2026 •

edited

Loading

qodo-code-review bot commented Mar 8, 2026 •

edited

Loading