Update version to 0.19.2 and enhance metadata handling in CyteType by parashardhapola · Pull Request #71 · NygenAnalytics/CyteType

parashardhapola · 2026-03-08T10:15:24Z

Bump package version to 0.19.2.
Introduce max_metadata_categories parameter to limit unique values in categorical obs columns during cluster metadata aggregation, improving memory efficiency.
Increase maximum upload size for obs_duckdb from 100MB to 2GB, accommodating larger datasets.
Refactor save_obs_duckdb function to ensure proper cleanup of temporary columns after processing.

- Bump package version to 0.19.2. - Introduce max_metadata_categories parameter to limit unique values in categorical obs columns during cluster metadata aggregation, improving memory efficiency. - Increase maximum upload size for obs_duckdb from 100MB to 2GB, accommodating larger datasets. - Refactor save_obs_duckdb function to ensure proper cleanup of temporary columns after processing.

qodo-code-review · 2026-03-08T10:15:39Z

Review Summary by Qodo

Enhance metadata handling and increase upload limits for memory efficiency

✨ Enhancement 🐞 Bug fix

Walkthroughs

Description

• Bump package version to 0.19.2
• Introduce max_metadata_categories parameter to limit categorical columns in metadata
  aggregation, preventing memory exhaustion from high-cardinality columns
• Increase obs_duckdb upload limit from 100MB to 2GB for larger datasets
• Refactor save_obs_duckdb to properly cleanup temporary coordinate columns after processing

Diagram

flowchart LR
  A["Version Bump<br/>0.19.1 → 0.19.2"] --> B["Memory Optimization"]
  B --> C["max_metadata_categories<br/>parameter"]
  B --> D["Temporary column<br/>cleanup"]
  E["Upload Limit<br/>100MB → 2GB"] --> F["Support larger<br/>datasets"]
  C --> G["Skip high-cardinality<br/>columns"]
  D --> H["Prevent dataframe<br/>pollution"]

File Changes

1. cytetype/__init__.py ⚙️ Configuration changes +1/-1

Version bump to 0.19.2
• Updated package version from 0.19.1 to 0.19.2
cytetype/init.py

2. cytetype/api/client.py ✨ Enhancement +1/-1

Increase obs_duckdb upload limit to 2GB

• Increased obs_duckdb maximum upload size from 100MB to 2GB
• Allows handling of larger datasets without hitting upload constraints

cytetype/api/client.py

3. cytetype/core/artifacts.py 🐞 Bug fix +13/-6

Add cleanup for temporary coordinate columns

• Track temporary coordinate columns added to obs dataframe
• Implement try-finally block to ensure cleanup of temporary columns after DuckDB table creation
• Prevents temporary columns from persisting in the original dataframe

cytetype/core/artifacts.py

View more (2)

4. cytetype/main.py ✨ Enhancement +6/-0

Add max_metadata_categories parameter to CyteType

• Add max_metadata_categories parameter to __init__ method with default value of 500
• Pass parameter to aggregate_cluster_metadata function call
• Add comprehensive docstring documentation for the new parameter

cytetype/main.py

5. cytetype/preprocessing/aggregation.py ✨ Enhancement +15/-2

Implement max_categories filtering in metadata aggregation

• Import logger from config module
• Add max_categories parameter to aggregate_cluster_metadata function
• Implement cardinality check to skip columns exceeding the threshold
• Log debug message when columns are skipped due to high unique value count
• Update docstring to document the new parameter and its purpose

cytetype/preprocessing/aggregation.py

qodo-code-review · 2026-03-08T10:15:40Z

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Positional args API break 🐞 Bug ✓ Correctness

Description

A new parameter was inserted mid-CyteType.__init__ signature, which breaks any downstream code
passing api_url/auth_token positionally (arguments will silently shift). This is especially
risky in a patch release because failures may be confusing (wrong endpoint/auth rather than an
immediate error).

Code

cytetype/main.py[R84-90]

        max_cells_per_group: int = 1000,
        vars_h5_path: str = "vars.h5",
        obs_duckdb_path: str = "obs.duckdb",
+        max_metadata_categories: int = 500,
        api_url: str = "https://prod.cytetype.nygen.io",
        auth_token: str | None = None,
    ) -> None:

Evidence

The constructor now places max_metadata_categories before api_url, so positional arguments
beyond obs_duckdb_path will bind to the wrong parameters.

cytetype/main.py[73-90]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`CyteType.__init__` added `max_metadata_categories` in the middle of the parameter list, which is a breaking change for any callers using positional args for `api_url`/`auth_token`.

## Issue Context
This is a library public API and the release is a patch bump; silent positional mis-binding can be very hard to debug.

## Fix Focus Areas
- cytetype/main.py[73-90]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. In-place obs_df mutation 🐞 Bug ⛯ Reliability

Description

save_obs_duckdb now adds and later drops temporary __vis_coordinates_* columns directly on the
passed DataFrame; in the common path this is cleaned up, but it can still delete pre-existing
columns with the same names. Also, the coordinate-column assignments occur before the try/finally,
so failures during assignment can leave temporary columns behind with no cleanup.

Code

cytetype/core/artifacts.py[R494-513]

+    added_cols: list[str] = []
    if obsm_coordinates is not None and coordinates_key is not None:
-        obs_df = obs_df.copy()
-        obs_df[f"__vis_coordinates_{coordinates_key}_1"] = obsm_coordinates[:, 0]
-        obs_df[f"__vis_coordinates_{coordinates_key}_2"] = obsm_coordinates[:, 1]
+        col1 = f"__vis_coordinates_{coordinates_key}_1"
+        col2 = f"__vis_coordinates_{coordinates_key}_2"
+        obs_df[col1] = obsm_coordinates[:, 0]
+        obs_df[col2] = obsm_coordinates[:, 1]
+        added_cols = [col1, col2]

    dd_config: dict[str, Any] = {
        "threads": threads,
        "memory_limit": memory_limit,
        "temp_directory": temp_directory,
    }
-    with duckdb.connect(out_file, config=dd_config) as con:
-        con.register("obs_df", obs_df)
-        con.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM obs_df")
+    try:
+        with duckdb.connect(out_file, config=dd_config) as con:
+            con.register("obs_df", obs_df)
+            con.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM obs_df")
+    finally:
+        for col in added_cols:
+            obs_df.drop(columns=col, inplace=True, errors="ignore")

Evidence

The function mutates obs_df in-place, and CyteType passes self.adata.obs as the obs_df
argument, so this directly affects the AnnData object. The try/finally only wraps the DuckDB
write, not the earlier column assignments.

cytetype/core/artifacts.py[494-513]
cytetype/main.py[251-264]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`save_obs_duckdb` mutates `obs_df` in-place (which is typically `adata.obs`) by adding temporary visualization coordinate columns and then dropping them. This can (a) delete user pre-existing columns with the same reserved name, and (b) leave temporary columns behind if assignment fails because assignments happen before the `try/finally`.

## Issue Context
The PR aims to improve cleanup/memory behavior, but should not risk changing user-visible `adata.obs` schema.

## Fix Focus Areas
- cytetype/core/artifacts.py[478-513]
- cytetype/main.py[251-264]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Category guard has gap 🐞 Bug ➹ Performance

Description

aggregate_cluster_metadata uses Series.nunique() to decide whether to skip a categorical column,
but for pandas categorical dtype this counts only observed values and may under-estimate the true
category space. This can reduce the effectiveness of the intended memory guard for categoricals with
many declared (unused) categories.

Code

cytetype/preprocessing/aggregation.py[R88-101]

        column_dtype = adata.obs[column_name].dtype
        if column_dtype in ["object", "category", "string"]:
-            # Calculate value counts for each group
+            n_unique = adata.obs[column_name].nunique()
+            if n_unique > max_categories:
+                logger.debug(
+                    "Skipping column '{}' ({} unique values > max_categories={}).",
+                    column_name,
+                    n_unique,
+                    max_categories,
+                )
+                continue
+
            value_counts_df = grouped_data[column_name].value_counts().unstack().T

Evidence

The function explicitly includes category dtype in the path, gates on nunique(), and then builds
a potentially wide intermediate via value_counts().unstack().T. If the dtype is categorical, using
category cardinality (or removing unused categories) is a more direct proxy for worst-case width.

cytetype/preprocessing/aggregation.py[81-105]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `max_categories` guard uses `.nunique()`, which can undercount for pandas categoricals with many declared but unused categories, weakening the memory-protection intent.

## Issue Context
This change is explicitly about preventing memory-expensive intermediate DataFrames; the guard should reflect the categorical cardinality behavior.

## Fix Focus Areas
- cytetype/preprocessing/aggregation.py[81-110]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. vars_h5 limit comment wrong 🐞 Bug ✧ Quality

Description

The vars_h5 max upload bytes constant is 50GiB, but the inline comment says 10GB, which can
mislead maintainers/users reading the code. This is documentation-only but easy to fix.

Code

cytetype/api/client.py[R15-18]

MAX_UPLOAD_BYTES: dict[UploadFileKind, int] = {
-    "obs_duckdb": 100 * 1024 * 1024,  # 100MB
+    "obs_duckdb": 2 * 1024 * 1024 * 1024,  # 2GB
    "vars_h5": 50 * 1024 * 1024 * 1024,  # 10GB
}

Evidence

The numeric expression evaluates to 50×1024³ bytes (~50GiB) while the comment states 10GB.

cytetype/api/client.py[15-18]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Inline comment for `MAX_UPLOAD_BYTES[&#x27;vars_h5&#x27;]` does not match the actual configured size.

## Issue Context
Avoid confusion when reasoning about client-side upload limits.

## Fix Focus Areas
- cytetype/api/client.py[15-18]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

cytetype/main.py

qodo-code-review bot reviewed Mar 8, 2026

View reviewed changes

cytetype/main.py Show resolved Hide resolved

parashardhapola merged commit 7a9b4a1 into master Mar 8, 2026
1 check passed

parashardhapola deleted the oom_fixes branch March 8, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update version to 0.19.2 and enhance metadata handling in CyteType#71

Update version to 0.19.2 and enhance metadata handling in CyteType#71
parashardhapola merged 1 commit intomasterfrom
oom_fixes

parashardhapola commented Mar 8, 2026

Uh oh!

qodo-code-review bot commented Mar 8, 2026

Uh oh!

qodo-code-review bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

parashardhapola commented Mar 8, 2026

Uh oh!

qodo-code-review bot commented Mar 8, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review bot commented Mar 8, 2026 •

edited

Loading