Update version to 0.18.1 and enhance public API formatting by parashardhapola · Pull Request #68 · NygenAnalytics/CyteType

parashardhapola · 2026-03-03T20:09:25Z

Bump package version to 0.18.1.
Refactor all in init.py for improved readability.
Update gene ID regex in validation.py to allow for versioned gene IDs.
Modify _id_like_percentage function to support random sampling with a seed for better statistical representation.

- Bump package version to 0.18.1. - Refactor __all__ in __init__.py for improved readability. - Update gene ID regex in validation.py to allow for versioned gene IDs. - Modify _id_like_percentage function to support random sampling with a seed for better statistical representation.

qodo-code-review · 2026-03-03T20:09:38Z

Review Summary by Qodo

Version 0.18.1: API formatting and gene ID validation improvements

✨ Enhancement

Walkthroughs

Description

• Bump package version to 0.18.1
• Refactor __all__ list formatting for improved readability
• Update gene ID regex to support versioned identifiers
• Enhance _id_like_percentage with random sampling and seed parameter

Diagram

flowchart LR
  A["Version Update<br/>0.18.0 → 0.18.1"] --> B["Public API<br/>Formatting"]
  A --> C["Gene ID<br/>Validation"]
  C --> D["Support Versioned<br/>Gene IDs"]
  C --> E["Random Sampling<br/>with Seed"]
  B --> F["Enhanced<br/>Readability"]

File Changes

1. cytetype/__init__.py ✨ Enhancement +7/-2

Version bump and public API list formatting
• Bumped version from 0.18.0 to 0.18.1
• Reformatted __all__ list with multi-line formatting for improved readability
cytetype/init.py

2. cytetype/preprocessing/validation.py ✨ Enhancement +10/-4

Gene ID validation and statistical sampling enhancements

• Added random module import for sampling functionality
• Updated gene ID regex pattern to support optional version suffixes (e.g., .1, .2)
• Enhanced _id_like_percentage function with configurable seed parameter for reproducible random
 sampling
• Increased sample size from 500 to 2000 for better statistical representation
• Implemented conditional random sampling when values exceed sample size threshold

cytetype/preprocessing/validation.py

qodo-code-review · 2026-03-03T20:09:39Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Warning example mismatch 🐞 Bug ✧ Quality

Description

_id_like_percentage now computes pct from a random sample, but warnings still show example IDs from
the first 20 entries. This can produce confusing messages (e.g., pct>50 with empty/unrepresentative
examples).

Code

cytetype/preprocessing/validation.py[R80-89]

+def _id_like_percentage(values: list[str], seed: int = 42) -> float:
    if not values:
        return 100.0
-    n = min(500, len(values))
-    sample = values[:n]
+    n = min(2000, len(values))
+    if n < len(values):
+        rng = random.Random(seed)
+        sample = rng.sample(values, n)
+    else:
+        sample = values
    return sum(1 for v in sample if _is_gene_id_like(v)) / n * 100

Evidence

The percent is computed from rng.sample(values, n) (random subset), while the warning examples are
derived from values_list[:20] (a different, deterministic subset), so the warning can cite
examples that don’t correspond to the sampled population that produced pct.

cytetype/preprocessing/validation.py[80-89]
cytetype/preprocessing/validation.py[103-110]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`pct` is computed from a sampled subset of values, but `example_ids` comes from the first 20 values. This can produce warnings where the % suggests many IDs but the examples list is empty or misleading.

### Issue Context
The mismatch was introduced when `_id_like_percentage` switched from `values[:n]` to random sampling. The warning content should stay consistent with how `pct` was computed.

### Fix Focus Areas
- cytetype/preprocessing/validation.py[80-113]

### Implementation notes
- Option A (best): Change `_id_like_percentage` to optionally return both `(pct, sample)` so callers can reuse the exact sample.
- Option B (localized): In `_validate_gene_symbols_column`, compute `example_ids` by scanning `values_list` (not just first 20) until 3 ID-like values are found.
- Option C: In `_validate_gene_symbols_column`, mirror `_id_like_percentage` sampling logic locally so `pct` and `example_ids` use the same `sample` list.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Higher validation CPU cost 🐞 Bug ➹ Performance

Description

_id_like_percentage now checks up to 2000 values instead of 500 and is called repeatedly during
gene-symbol auto-detection across many columns. This increases total regex work and could slow
validation on datasets with many candidate columns.

Code

cytetype/preprocessing/validation.py[R83-86]

+    n = min(2000, len(values))
+    if n < len(values):
+        rng = random.Random(seed)
+        sample = rng.sample(values, n)

Evidence

The per-call cap increased to 2000, and the function is invoked in multiple loops over adata.var
columns (known columns, var_names, and then all other columns), multiplying the added per-call cost
by the number of candidate columns.

cytetype/preprocessing/validation.py[80-89]
cytetype/preprocessing/validation.py[157-202]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_id_like_percentage` increased its sample cap (now up to 2000). Because `resolve_gene_symbols_column` calls it for many candidate columns, this raises total regex work and may slow validation for unusually wide `adata.var` tables.

### Issue Context
This is a trade-off: higher `n` reduces sampling variance but increases CPU time. The best fix depends on typical dataset sizes in your users’ workflows.

### Fix Focus Areas
- cytetype/preprocessing/validation.py[80-90]
- cytetype/preprocessing/validation.py[145-202]

### Implementation notes
- Add a module-level constant (e.g., `ID_LIKE_SAMPLE_CAP = 500|1000|2000`) or make it a parameter passed from `resolve_gene_symbols_column`.
- Precompile the regexes used by `_is_gene_id_like` (if profiling shows regex dominates).
- Optionally lower the cap when iterating many columns (e.g., `cap = min(2000, max(200, 2000 // n_candidate_columns))`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

parashardhapola merged commit 78cc799 into master Mar 3, 2026
1 check passed

parashardhapola deleted the fixes branch March 3, 2026 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update version to 0.18.1 and enhance public API formatting#68

Update version to 0.18.1 and enhance public API formatting#68
parashardhapola merged 1 commit intomasterfrom
fixes

parashardhapola commented Mar 3, 2026

Uh oh!

qodo-code-review bot commented Mar 3, 2026

Uh oh!

qodo-code-review bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

parashardhapola commented Mar 3, 2026

Uh oh!

qodo-code-review bot commented Mar 3, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review bot commented Mar 3, 2026 •

edited

Loading