Skip to content

Update version to 0.19.2 and enhance metadata handling in CyteType#71

Merged
parashardhapola merged 1 commit intomasterfrom
oom_fixes
Mar 8, 2026
Merged

Update version to 0.19.2 and enhance metadata handling in CyteType#71
parashardhapola merged 1 commit intomasterfrom
oom_fixes

Conversation

@parashardhapola
Copy link
Member

  • Bump package version to 0.19.2.
  • Introduce max_metadata_categories parameter to limit unique values in categorical obs columns during cluster metadata aggregation, improving memory efficiency.
  • Increase maximum upload size for obs_duckdb from 100MB to 2GB, accommodating larger datasets.
  • Refactor save_obs_duckdb function to ensure proper cleanup of temporary columns after processing.

- Bump package version to 0.19.2.
- Introduce max_metadata_categories parameter to limit unique values in categorical obs columns during cluster metadata aggregation, improving memory efficiency.
- Increase maximum upload size for obs_duckdb from 100MB to 2GB, accommodating larger datasets.
- Refactor save_obs_duckdb function to ensure proper cleanup of temporary columns after processing.
@qodo-code-review
Copy link

Review Summary by Qodo

Enhance metadata handling and increase upload limits for memory efficiency

✨ Enhancement 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Bump package version to 0.19.2
• Introduce max_metadata_categories parameter to limit categorical columns in metadata
  aggregation, preventing memory exhaustion from high-cardinality columns
• Increase obs_duckdb upload limit from 100MB to 2GB for larger datasets
• Refactor save_obs_duckdb to properly cleanup temporary coordinate columns after processing
Diagram
flowchart LR
  A["Version Bump<br/>0.19.1 → 0.19.2"] --> B["Memory Optimization"]
  B --> C["max_metadata_categories<br/>parameter"]
  B --> D["Temporary column<br/>cleanup"]
  E["Upload Limit<br/>100MB → 2GB"] --> F["Support larger<br/>datasets"]
  C --> G["Skip high-cardinality<br/>columns"]
  D --> H["Prevent dataframe<br/>pollution"]
Loading

Grey Divider

File Changes

1. cytetype/__init__.py ⚙️ Configuration changes +1/-1

Version bump to 0.19.2

• Updated package version from 0.19.1 to 0.19.2

cytetype/init.py


2. cytetype/api/client.py ✨ Enhancement +1/-1

Increase obs_duckdb upload limit to 2GB

• Increased obs_duckdb maximum upload size from 100MB to 2GB
• Allows handling of larger datasets without hitting upload constraints

cytetype/api/client.py


3. cytetype/core/artifacts.py 🐞 Bug fix +13/-6

Add cleanup for temporary coordinate columns

• Track temporary coordinate columns added to obs dataframe
• Implement try-finally block to ensure cleanup of temporary columns after DuckDB table creation
• Prevents temporary columns from persisting in the original dataframe

cytetype/core/artifacts.py


View more (2)
4. cytetype/main.py ✨ Enhancement +6/-0

Add max_metadata_categories parameter to CyteType

• Add max_metadata_categories parameter to __init__ method with default value of 500
• Pass parameter to aggregate_cluster_metadata function call
• Add comprehensive docstring documentation for the new parameter

cytetype/main.py


5. cytetype/preprocessing/aggregation.py ✨ Enhancement +15/-2

Implement max_categories filtering in metadata aggregation

• Import logger from config module
• Add max_categories parameter to aggregate_cluster_metadata function
• Implement cardinality check to skip columns exceeding the threshold
• Log debug message when columns are skipped due to high unique value count
• Update docstring to document the new parameter and its purpose

cytetype/preprocessing/aggregation.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 8, 2026

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Positional args API break 🐞 Bug ✓ Correctness
Description
A new parameter was inserted mid-CyteType.__init__ signature, which breaks any downstream code
passing api_url/auth_token positionally (arguments will silently shift). This is especially
risky in a patch release because failures may be confusing (wrong endpoint/auth rather than an
immediate error).
Code

cytetype/main.py[R84-90]

        max_cells_per_group: int = 1000,
        vars_h5_path: str = "vars.h5",
        obs_duckdb_path: str = "obs.duckdb",
+        max_metadata_categories: int = 500,
        api_url: str = "https://prod.cytetype.nygen.io",
        auth_token: str | None = None,
    ) -> None:
Evidence
The constructor now places max_metadata_categories before api_url, so positional arguments
beyond obs_duckdb_path will bind to the wrong parameters.

cytetype/main.py[73-90]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`CyteType.__init__` added `max_metadata_categories` in the middle of the parameter list, which is a breaking change for any callers using positional args for `api_url`/`auth_token`.

## Issue Context
This is a library public API and the release is a patch bump; silent positional mis-binding can be very hard to debug.

## Fix Focus Areas
- cytetype/main.py[73-90]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. In-place obs_df mutation 🐞 Bug ⛯ Reliability
Description
save_obs_duckdb now adds and later drops temporary __vis_coordinates_* columns directly on the
passed DataFrame; in the common path this is cleaned up, but it can still delete pre-existing
columns with the same names. Also, the coordinate-column assignments occur before the try/finally,
so failures during assignment can leave temporary columns behind with no cleanup.
Code

cytetype/core/artifacts.py[R494-513]

+    added_cols: list[str] = []
    if obsm_coordinates is not None and coordinates_key is not None:
-        obs_df = obs_df.copy()
-        obs_df[f"__vis_coordinates_{coordinates_key}_1"] = obsm_coordinates[:, 0]
-        obs_df[f"__vis_coordinates_{coordinates_key}_2"] = obsm_coordinates[:, 1]
+        col1 = f"__vis_coordinates_{coordinates_key}_1"
+        col2 = f"__vis_coordinates_{coordinates_key}_2"
+        obs_df[col1] = obsm_coordinates[:, 0]
+        obs_df[col2] = obsm_coordinates[:, 1]
+        added_cols = [col1, col2]

    dd_config: dict[str, Any] = {
        "threads": threads,
        "memory_limit": memory_limit,
        "temp_directory": temp_directory,
    }
-    with duckdb.connect(out_file, config=dd_config) as con:
-        con.register("obs_df", obs_df)
-        con.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM obs_df")
+    try:
+        with duckdb.connect(out_file, config=dd_config) as con:
+            con.register("obs_df", obs_df)
+            con.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM obs_df")
+    finally:
+        for col in added_cols:
+            obs_df.drop(columns=col, inplace=True, errors="ignore")
Evidence
The function mutates obs_df in-place, and CyteType passes self.adata.obs as the obs_df
argument, so this directly affects the AnnData object. The try/finally only wraps the DuckDB
write, not the earlier column assignments.

cytetype/core/artifacts.py[494-513]
cytetype/main.py[251-264]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`save_obs_duckdb` mutates `obs_df` in-place (which is typically `adata.obs`) by adding temporary visualization coordinate columns and then dropping them. This can (a) delete user pre-existing columns with the same reserved name, and (b) leave temporary columns behind if assignment fails because assignments happen before the `try/finally`.

## Issue Context
The PR aims to improve cleanup/memory behavior, but should not risk changing user-visible `adata.obs` schema.

## Fix Focus Areas
- cytetype/core/artifacts.py[478-513]
- cytetype/main.py[251-264]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Category guard has gap 🐞 Bug ➹ Performance
Description
aggregate_cluster_metadata uses Series.nunique() to decide whether to skip a categorical column,
but for pandas categorical dtype this counts only observed values and may under-estimate the true
category space. This can reduce the effectiveness of the intended memory guard for categoricals with
many declared (unused) categories.
Code

cytetype/preprocessing/aggregation.py[R88-101]

        column_dtype = adata.obs[column_name].dtype
        if column_dtype in ["object", "category", "string"]:
-            # Calculate value counts for each group
+            n_unique = adata.obs[column_name].nunique()
+            if n_unique > max_categories:
+                logger.debug(
+                    "Skipping column '{}' ({} unique values > max_categories={}).",
+                    column_name,
+                    n_unique,
+                    max_categories,
+                )
+                continue
+
            value_counts_df = grouped_data[column_name].value_counts().unstack().T
Evidence
The function explicitly includes category dtype in the path, gates on nunique(), and then builds
a potentially wide intermediate via value_counts().unstack().T. If the dtype is categorical, using
category cardinality (or removing unused categories) is a more direct proxy for worst-case width.

cytetype/preprocessing/aggregation.py[81-105]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `max_categories` guard uses `.nunique()`, which can undercount for pandas categoricals with many declared but unused categories, weakening the memory-protection intent.

## Issue Context
This change is explicitly about preventing memory-expensive intermediate DataFrames; the guard should reflect the categorical cardinality behavior.

## Fix Focus Areas
- cytetype/preprocessing/aggregation.py[81-110]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

4. vars_h5 limit comment wrong 🐞 Bug ✧ Quality
Description
The vars_h5 max upload bytes constant is 50GiB, but the inline comment says 10GB, which can
mislead maintainers/users reading the code. This is documentation-only but easy to fix.
Code

cytetype/api/client.py[R15-18]

MAX_UPLOAD_BYTES: dict[UploadFileKind, int] = {
-    "obs_duckdb": 100 * 1024 * 1024,  # 100MB
+    "obs_duckdb": 2 * 1024 * 1024 * 1024,  # 2GB
    "vars_h5": 50 * 1024 * 1024 * 1024,  # 10GB
}
Evidence
The numeric expression evaluates to 50×1024³ bytes (~50GiB) while the comment states 10GB.

cytetype/api/client.py[15-18]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Inline comment for `MAX_UPLOAD_BYTES[&#x27;vars_h5&#x27;]` does not match the actual configured size.

## Issue Context
Avoid confusion when reasoning about client-side upload limits.

## Fix Focus Areas
- cytetype/api/client.py[15-18]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@parashardhapola parashardhapola merged commit 7a9b4a1 into master Mar 8, 2026
1 check passed
@parashardhapola parashardhapola deleted the oom_fixes branch March 8, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant