Skip to content

Update version to 0.18.1 and enhance public API formatting#68

Merged
parashardhapola merged 1 commit intomasterfrom
fixes
Mar 3, 2026
Merged

Update version to 0.18.1 and enhance public API formatting#68
parashardhapola merged 1 commit intomasterfrom
fixes

Conversation

@parashardhapola
Copy link
Member

  • Bump package version to 0.18.1.
  • Refactor all in init.py for improved readability.
  • Update gene ID regex in validation.py to allow for versioned gene IDs.
  • Modify _id_like_percentage function to support random sampling with a seed for better statistical representation.

- Bump package version to 0.18.1.
- Refactor __all__ in __init__.py for improved readability.
- Update gene ID regex in validation.py to allow for versioned gene IDs.
- Modify _id_like_percentage function to support random sampling with a seed for better statistical representation.
@qodo-code-review
Copy link

Review Summary by Qodo

Version 0.18.1: API formatting and gene ID validation improvements

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Bump package version to 0.18.1
• Refactor __all__ list formatting for improved readability
• Update gene ID regex to support versioned identifiers
• Enhance _id_like_percentage with random sampling and seed parameter
Diagram
flowchart LR
  A["Version Update<br/>0.18.0 → 0.18.1"] --> B["Public API<br/>Formatting"]
  A --> C["Gene ID<br/>Validation"]
  C --> D["Support Versioned<br/>Gene IDs"]
  C --> E["Random Sampling<br/>with Seed"]
  B --> F["Enhanced<br/>Readability"]
Loading

Grey Divider

File Changes

1. cytetype/__init__.py ✨ Enhancement +7/-2

Version bump and public API list formatting

• Bumped version from 0.18.0 to 0.18.1
• Reformatted __all__ list with multi-line formatting for improved readability

cytetype/init.py


2. cytetype/preprocessing/validation.py ✨ Enhancement +10/-4

Gene ID validation and statistical sampling enhancements

• Added random module import for sampling functionality
• Updated gene ID regex pattern to support optional version suffixes (e.g., .1, .2)
• Enhanced _id_like_percentage function with configurable seed parameter for reproducible random
 sampling
• Increased sample size from 500 to 2000 for better statistical representation
• Implemented conditional random sampling when values exceed sample size threshold

cytetype/preprocessing/validation.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 3, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Warning example mismatch 🐞 Bug ✧ Quality
Description
_id_like_percentage now computes pct from a random sample, but warnings still show example IDs from
the first 20 entries. This can produce confusing messages (e.g., pct>50 with empty/unrepresentative
examples).
Code

cytetype/preprocessing/validation.py[R80-89]

+def _id_like_percentage(values: list[str], seed: int = 42) -> float:
    if not values:
        return 100.0
-    n = min(500, len(values))
-    sample = values[:n]
+    n = min(2000, len(values))
+    if n < len(values):
+        rng = random.Random(seed)
+        sample = rng.sample(values, n)
+    else:
+        sample = values
    return sum(1 for v in sample if _is_gene_id_like(v)) / n * 100
Evidence
The percent is computed from rng.sample(values, n) (random subset), while the warning examples are
derived from values_list[:20] (a different, deterministic subset), so the warning can cite
examples that don’t correspond to the sampled population that produced pct.

cytetype/preprocessing/validation.py[80-89]
cytetype/preprocessing/validation.py[103-110]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`pct` is computed from a sampled subset of values, but `example_ids` comes from the first 20 values. This can produce warnings where the % suggests many IDs but the examples list is empty or misleading.

### Issue Context
The mismatch was introduced when `_id_like_percentage` switched from `values[:n]` to random sampling. The warning content should stay consistent with how `pct` was computed.

### Fix Focus Areas
- cytetype/preprocessing/validation.py[80-113]

### Implementation notes
- Option A (best): Change `_id_like_percentage` to optionally return both `(pct, sample)` so callers can reuse the exact sample.
- Option B (localized): In `_validate_gene_symbols_column`, compute `example_ids` by scanning `values_list` (not just first 20) until 3 ID-like values are found.
- Option C: In `_validate_gene_symbols_column`, mirror `_id_like_percentage` sampling logic locally so `pct` and `example_ids` use the same `sample` list.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

2. Higher validation CPU cost 🐞 Bug ➹ Performance
Description
_id_like_percentage now checks up to 2000 values instead of 500 and is called repeatedly during
gene-symbol auto-detection across many columns. This increases total regex work and could slow
validation on datasets with many candidate columns.
Code

cytetype/preprocessing/validation.py[R83-86]

+    n = min(2000, len(values))
+    if n < len(values):
+        rng = random.Random(seed)
+        sample = rng.sample(values, n)
Evidence
The per-call cap increased to 2000, and the function is invoked in multiple loops over adata.var
columns (known columns, var_names, and then all other columns), multiplying the added per-call cost
by the number of candidate columns.

cytetype/preprocessing/validation.py[80-89]
cytetype/preprocessing/validation.py[157-202]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_id_like_percentage` increased its sample cap (now up to 2000). Because `resolve_gene_symbols_column` calls it for many candidate columns, this raises total regex work and may slow validation for unusually wide `adata.var` tables.

### Issue Context
This is a trade-off: higher `n` reduces sampling variance but increases CPU time. The best fix depends on typical dataset sizes in your users’ workflows.

### Fix Focus Areas
- cytetype/preprocessing/validation.py[80-90]
- cytetype/preprocessing/validation.py[145-202]

### Implementation notes
- Add a module-level constant (e.g., `ID_LIKE_SAMPLE_CAP = 500|1000|2000`) or make it a parameter passed from `resolve_gene_symbols_column`.
- Precompile the regexes used by `_is_gene_id_like` (if profiling shows regex dominates).
- Optionally lower the cap when iterating many columns (e.g., `cap = min(2000, max(200, 2000 // n_candidate_columns))`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@parashardhapola parashardhapola merged commit 78cc799 into master Mar 3, 2026
1 check passed
@parashardhapola parashardhapola deleted the fixes branch March 3, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant