Skip to content

Raw counts support, backed-mode differential expression, and auto-detection of gene symbols#67

Merged
parashardhapola merged 19 commits intomasterfrom
raw_counts
Mar 3, 2026
Merged

Raw counts support, backed-mode differential expression, and auto-detection of gene symbols#67
parashardhapola merged 19 commits intomasterfrom
raw_counts

Conversation

@parashardhapola
Copy link
Member

Summary

This release introduces raw counts embedding in artifacts, a memory-efficient rank_genes_groups_backed for on-disk AnnData, automatic gene symbol column detection, cell subsampling, and a category-grouped dotplot utility. Artifact build/upload is restructured for clarity, and progress reporting uses tqdm throughout.


✨ What's New

🧬 Raw Counts in vars.h5 Artifact

  • The save_features_matrix function now writes an optional raw group to the H5 artifact containing integer raw counts (LZ4-compressed CSR).
  • CyteType.__init__ auto-resolves raw counts from adata.layers['counts'], adata.raw.X, or adata.X (if integer-valued), and embeds them alongside normalized counts.

🚀 rank_genes_groups_backed — Memory-Efficient Differential Expression

  • New public function — a drop-in replacement for sc.tl.rank_genes_groups that works on backed/on-disk _CSRDataset matrices.
  • Streams cell chunks in a single pass, computes Welch's t-test (one-vs-rest) with BH or Bonferroni correction, and writes scanpy-compatible output to adata.uns.
  • Exported at cytetype.rank_genes_groups_backed.

✂️ subsample_by_group — Per-Group Cell Subsampling

  • New preprocessing utility that caps each cluster to a configurable maximum number of cells (max_cells_per_group), keeping smaller groups intact.
  • Works with both in-memory and backed AnnData objects.

🔍 Auto-Detection of Gene Symbols Column

  • gene_symbols_column now defaults to None and auto-detects by checking well-known column names (feature_name, gene_symbols, etc.), then adata.var_names, then a heuristic scan of all var columns.
  • Detects and skips composite gene values (e.g., TSPAN6_ENSG00000000003).
  • Candidates are scored by ID-like percentage, uniqueness ratio, and priority — the best non-ID column wins.

🎨 marker_dotplot — Category-Grouped Dot Plot

  • New plotting module (cytetype.plotting) with marker_dotplot that reads stored CyteType results and creates a scanpy dotplot grouped by cluster categories with top supporting marker genes.

⚡ Improvements

🏗️ Artifact Pipeline Restructuring

  • Artifacts (vars.h5, obs.duckdb) are now built during __init__ and uploaded during run(), decoupling build from upload.
  • vars_h5_path and obs_duckdb_path moved from run() to __init__() parameters.
  • New cleanup() method replaces the removed cleanup_artifacts parameter on run().

💾 CSR-Backed Write Path for Normalized Counts

  • New two-pass column-group scatter algorithm (_write_csc_via_row_batches) converts CSR-backed data to CSC in the H5 file without loading the full matrix.
  • Configurable memory budget via WRITE_MEM_BUDGET (default 4 GB).

📊 Expression Percentage Calculation

  • Refactored to use single-pass row-batched accumulation (reuses _accumulate_group_stats) instead of gene-batched pandas groupby.
  • Default pcent_batch_size increased from 2000 to 5000.

☁️ Upload Enhancements

  • vars_h5 max upload size increased from 10 GB to 50 GB.
  • Upload progress now uses tqdm progress bars when available.
  • Default connect timeout increased from 30s to 60s.

📈 Progress Reporting

  • tqdm progress bars added throughout: rank_genes_groups, subsampling, raw counts writing, normalized counts writing, and chunk uploads.

🐛 Bug Fixes / Error Handling

  • 🆕 New ClientDisconnectedError exception for HTTP 499 / CLIENT_DISCONNECTED responses.
  • 🗑️ Removed stale hasattr(adata.var, gene_symbols_col) check (was always True for DataFrames).
  • 🛡️ Raw counts write failures are caught gracefully — the raw group is cleaned up and skipped with a warning.

⚠️ Breaking Changes

  • gene_symbols_column default changed from "gene_symbols" to None (auto-detect).
  • vars_h5_path and obs_duckdb_path moved from run() to CyteType.__init__().
  • cleanup_artifacts parameter removed from run(); use cleanup() method instead.
  • pcent_batch_size default changed from 2000 to 5000.
  • batch_size parameter in aggregate_expression_percentages renamed to cell_batch_size.

…ures_matrix

- Bump package version to 0.18.0.
- Introduce _resolve_raw_counts method in CyteType to improve raw counts extraction from AnnData.
- Add _is_integer_valued utility function to check if matrices contain integer values.
- Update save_features_matrix to handle raw counts and include them in the output HDF5 file.
- Enhance tests to cover new raw counts functionality and integer value checks.
…ing and uploading

- Introduced vars_h5_path and obs_duckdb_path parameters in CyteType for customizable artifact paths.
- Implemented caching of raw counts and improved error handling during artifact creation.
- Updated _upload_artifacts method to handle pre-built artifacts and log errors appropriately.
- Modified integration tests to accommodate new parameters and ensure proper artifact cleanup.
- Replaced the static method _cleanup_artifact_files with an instance method cleanup to manage artifact file deletion after run completion.
- Removed the cleanup_artifacts parameter from run method, simplifying the interface.
- Updated integration tests to verify that cleanup correctly deletes artifact files and clears associated paths.
- Introduced rank_genes_groups_backed in marker_detection.py for memory-efficient gene ranking on backed AnnData objects.
- Updated __init__.py files to include rank_genes_groups_backed in the public API of cytetype and preprocessing modules.
- Refactored code for improved readability in main.py, enhancing the formatting of artifact cleanup logic.
- Introduced resolve_gene_symbols_column function to auto-detect gene symbols in AnnData, improving flexibility in gene symbol management.
- Updated gene_symbols_column type to accept None, allowing for better handling of cases where gene symbols are not explicitly provided.
- Refactored aggregate_expression_percentages and extract_marker_genes functions to accommodate the new gene symbol resolution logic.
- Enhanced validation in _validate_gene_symbols_column to provide clearer warnings about potential gene ID misclassifications.
… aggregation logic

- Increased the default batch size for calculating expression percentages from 2000 to 5000 to optimize memory usage.
- Refactored the aggregate_expression_percentages function to utilize a single-pass row-batched accumulation method for improved performance.
- Introduced a new _accumulate_group_stats function to streamline the computation of per-group statistics, enhancing efficiency for large datasets.
- Updated related documentation to reflect changes in parameters and processing logic.
- Removed unnecessary logging statements for calculating expression percentages and extracting visualization coordinates to streamline output.
- Updated logging message for saving obs.duckdb artifact for clarity.
- Integrated progress reporting using tqdm for batch processing in save_features_matrix and extract_visualization_coordinates functions.
- Improved handling of warnings during batch processing to suppress FutureWarnings from tqdm.
- Adjusted progress descriptions for better user feedback during long-running operations.
- Introduced WRITE_MEM_BUDGET constant in config.py to define memory budget for writing artifacts.
- Updated logging messages in main.py for clarity during artifact saving processes.
- Enhanced progress reporting in artifact writing functions to improve user feedback.
- Refactored warning handling to suppress FutureWarnings from tqdm during batch processing.
- Added new functions in artifacts.py for improved handling of sparse matrix writing and progress tracking.
- Increased maximum upload size for vars_h5 from 10GB to 50GB to accommodate larger datasets.
- Introduced a new ClientDisconnectedError exception to handle client disconnection scenarios.
- Improved progress reporting during file uploads by integrating tqdm for better user feedback.
- Refactored upload logic to ensure consistent progress updates and error handling across different upload scenarios.
- Introduced a new `subsample_by_group` function in `subsampling.py` to limit the number of cells per group in an AnnData object.
- Updated `__init__.py` to include `subsample_by_group` in the public API of the preprocessing module.
- Enhanced error handling to check for the existence of the specified group key in the AnnData object.
- Added logging to report the results of the subsampling process.
…ng module

- Enhanced the `subsample_by_group` function to optimize performance and memory usage during subsampling.
- Improved logging to provide clearer insights into the subsampling process and results.
- Updated error handling to ensure robustness when dealing with edge cases in AnnData objects.
- Refactored related tests to validate the new subsampling logic and logging enhancements.
…nce in the preprocessing module

- Modified the `subsample_by_group` function to use `merge="first"` when concatenating subsampled subsets, ensuring that the first occurrence of each observation is retained.
- This change enhances the subsampling process by providing a more consistent output when merging groups.
@qodo-code-review
Copy link

Review Summary by Qodo

v0.18.0: Raw counts support, backed-mode differential expression, and auto-detection of gene symbols

✨ Enhancement 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Add raw counts support in vars.h5 artifact with LZ4 compression
• Implement memory-efficient rank_genes_groups_backed for backed AnnData
• Auto-detect gene symbols column with heuristic scoring algorithm
• Restructure artifact building to __init__ and uploading to run()
• Add tqdm progress bars throughout data processing pipeline
• Increase vars_h5 max upload size from 10GB to 50GB
• Implement subsample_by_group utility for per-cluster cell capping
• Add marker_dotplot plotting module for category-grouped visualization
Diagram
flowchart LR
  A["CyteType.__init__"] -->|resolve raw counts| B["_resolve_raw_counts"]
  A -->|build artifacts| C["save_features_matrix"]
  C -->|write normalized| D["_write_csc_via_row_batches<br/>or col_batches"]
  C -->|write raw| E["_write_raw_group"]
  A -->|build obs| F["save_obs_duckdb_file"]
  A -->|auto-detect symbols| G["resolve_gene_symbols_column"]
  H["CyteType.run"] -->|upload artifacts| I["_upload_artifacts"]
  H -->|cleanup| J["cleanup"]
  K["rank_genes_groups_backed"] -->|stream cells| L["_accumulate_group_stats"]
  L -->|compute t-test| M["ttest_ind_from_stats"]
  N["marker_dotplot"] -->|read results| O["load_local_results"]
Loading

Grey Divider

File Changes

1. cytetype/__init__.py ✨ Enhancement +3/-2

Export rank_genes_groups_backed and bump version

cytetype/init.py


2. cytetype/api/client.py ✨ Enhancement +46/-20

Add tqdm progress bars and increase upload timeout

cytetype/api/client.py


3. cytetype/api/exceptions.py Error handling +7/-0

Add ClientDisconnectedError for HTTP 499 responses

cytetype/api/exceptions.py


View more (13)
4. cytetype/config.py ⚙️ Configuration changes +2/-0

Add WRITE_MEM_BUDGET constant for artifact writing

cytetype/config.py


5. cytetype/core/artifacts.py ✨ Enhancement +320/-44

Implement raw counts and CSR-backed matrix writing

cytetype/core/artifacts.py


6. cytetype/main.py ✨ Enhancement +207/-146

Restructure artifact building/uploading and add raw counts resolution

cytetype/main.py


7. cytetype/plotting/__init__.py ✨ Enhancement +5/-0

New plotting module with marker_dotplot export

cytetype/plotting/init.py


8. cytetype/plotting/dotplot.py ✨ Enhancement +105/-0

Implement marker_dotplot for category-grouped visualization

cytetype/plotting/dotplot.py


9. cytetype/preprocessing/__init__.py ✨ Enhancement +6/-1

Export rank_genes_groups_backed and subsample_by_group

cytetype/preprocessing/init.py


10. cytetype/preprocessing/aggregation.py ✨ Enhancement +34/-22

Refactor to single-pass row-batched expression percentage calculation

cytetype/preprocessing/aggregation.py


11. cytetype/preprocessing/extraction.py ✨ Enhancement +28/-14

Support None gene_symbols_col and add tqdm progress

cytetype/preprocessing/extraction.py


12. cytetype/preprocessing/marker_detection.py ✨ Enhancement +289/-0

Implement rank_genes_groups_backed and group statistics accumulation

cytetype/preprocessing/marker_detection.py


13. cytetype/preprocessing/subsampling.py ✨ Enhancement +79/-0

New subsample_by_group utility for per-cluster cell capping

cytetype/preprocessing/subsampling.py


14. cytetype/preprocessing/validation.py ✨ Enhancement +139/-61

Implement auto-detection of gene symbols column with heuristic scoring

cytetype/preprocessing/validation.py


15. tests/test_artifacts.py 🧪 Tests +161/-1

Add comprehensive tests for raw counts and backed matrix writing

tests/test_artifacts.py


16. tests/test_cytetype_integration.py 🧪 Tests +13/-6

Update integration tests for artifact cleanup refactoring

tests/test_cytetype_integration.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 3, 2026

Code Review by Qodo

🐞 Bugs (5) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Budget not a hard cap 🐞 Bug ⛯ Reliability
Description
_write_csc_via_row_batches groups columns by WRITE_MEM_BUDGET, but forces at least one column per
group even if that single column exceeds the budget. This breaks the intended memory bound and can
lead to unexpectedly large allocations for certain datasets or when WRITE_MEM_BUDGET is tuned low.
Code

cytetype/core/artifacts.py[R271-289]

+    max_nnz_per_group = max(1, WRITE_MEM_BUDGET // 8)
+
+    c_start = 0
+    group_idx = 0
+    while c_start < n_cols:
+        cumulative = np.cumsum(col_counts[c_start:])
+        over_budget = np.searchsorted(cumulative, max_nnz_per_group, side="right")
+        c_end = c_start + max(1, int(over_budget))
+        c_end = min(c_end, n_cols)
+
+        group_nnz = int(indptr[c_end] - indptr[c_start])
+        if group_nnz == 0:
+            c_start = c_end
+            group_idx += 1
+            continue
+
+        grp_indices = np.empty(group_nnz, dtype=np.int32)
+        grp_data = np.empty(group_nnz, dtype=np.float32)
+
Evidence
The code computes a max nnz budget per group, but uses max(1, over_budget) which can select a group
where group_nnz > max_nnz_per_group; it then allocates grp_indices/grp_data sized by group_nnz
regardless of the budget.

cytetype/core/artifacts.py[271-292]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_write_csc_via_row_batches` is intended to respect `WRITE_MEM_BUDGET`, but it can still allocate temporary arrays larger than the budget when a single column exceeds the per-group nnz limit (because the algorithm forces at least one column per group).
### Issue Context
This affects the CSR-backed normalized-counts artifact write path and can lead to unexpectedly large allocations or crashes, especially if users reduce `WRITE_MEM_BUDGET` to run on smaller machines.
### Fix Focus Areas
- cytetype/core/artifacts.py[271-292]
- cytetype/core/artifacts.py[281-289]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Stats path densifies chunks🐞 Bug ➹ Performance
Description
_accumulate_group_stats converts each row batch to a dense float64 array and performs dense matrix
multiplications. For sparse/backed scRNA matrices with many genes, the per-batch dense
materialization can be very large and undermines the PR’s “memory-efficient” claims unless users
manually tune cell_batch_size.
Code

cytetype/preprocessing/marker_detection.py[R59-77]

+    for start in chunk_iter:
+        end = min(start + cell_batch_size, n_cells)
+        chunk = X[start:end]
+        if hasattr(chunk, "toarray"):
+            chunk = chunk.toarray()
+        chunk = np.asarray(chunk, dtype=np.float64)
+        chunk_labels = cell_group_indices[start:end]
+
+        batch_len = end - start
+        indicator = np.zeros((n_groups, batch_len), dtype=np.float64)
+        indicator[chunk_labels, np.arange(batch_len)] = 1.0
+
+        n_ += indicator.sum(axis=1).astype(np.int64)
+        if sum_ is not None:
+            sum_ += indicator @ chunk
+        if sum_sq_ is not None:
+            sum_sq_ += indicator @ (chunk**2)
+        if nnz_ is not None:
+            nnz_ += (indicator @ (chunk != 0).astype(np.float64)).astype(np.int64)
Evidence
The accumulation routine explicitly densifies any chunk with .toarray() and casts to float64. Both
expression percentage calculation and rank_genes_groups_backed call this routine, so the
dense-per-batch behavior is on the hot path for the new features.

cytetype/preprocessing/marker_detection.py[59-77]
cytetype/preprocessing/aggregation.py[31-39]
cytetype/preprocessing/marker_detection.py[168-177]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_accumulate_group_stats` densifies `X[start:end]` (via `.toarray()`) and casts to float64, which can create very large temporary allocations per batch for wide scRNA matrices. This impacts both `aggregate_expression_percentages` and `rank_genes_groups_backed`.
### Issue Context
The PR positions `rank_genes_groups_backed` as memory-efficient for backed CSR data, but dense-per-batch materialization can still be large.
### Fix Focus Areas
- cytetype/preprocessing/marker_detection.py[59-77]
- cytetype/preprocessing/marker_detection.py[167-177]
- cytetype/preprocessing/aggregation.py[31-39]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Raw counts may truncate🐞 Bug ✓ Correctness
Description
Raw counts selection is based on a heuristic that only samples the first rows, and raw values are
always cast to int32 when written. If the heuristic misclassifies a float matrix as integer-valued,
or if counts exceed int32 range, the raw artifact can be silently corrupted (truncation/overflow).
Code

cytetype/core/artifacts.py[R116-135]

+def _is_integer_valued(mat: Any, sample_n_rows: int = 200) -> bool:
+    if hasattr(mat, "dtype") and np.issubdtype(mat.dtype, np.integer):
+        return True
+
+    n_rows = mat.shape[0]
+    row_end = min(sample_n_rows, n_rows)
+    chunk = mat[:row_end]
+
+    if sp.issparse(chunk):
+        sample = chunk.data
+    elif hasattr(chunk, "toarray"):
+        sample = chunk.toarray().ravel()
+    else:
+        sample = np.asarray(chunk).ravel()
+
+    if sample.size == 0:
+        return True
+
+    sample = sample.astype(np.float64, copy=False)
+    return bool(np.all(np.isfinite(sample)) and np.all(sample == np.floor(sample)))
Evidence
_is_integer_valued bases its decision on only the first sample_n_rows rows, which is not a full
validation. When writing raw counts, values are cast to int32 without range checks. CyteType uses
_is_integer_valued to choose raw sources (layers['counts'], raw.X, or X).

cytetype/core/artifacts.py[116-135]
cytetype/core/artifacts.py[196-205]
cytetype/main.py[269-286]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Raw counts are detected via a limited sample and written as int32 without bounds checks. This can silently corrupt raw artifacts if misdetected or out-of-range.
### Issue Context
`CyteType.__init__` auto-resolves raw counts from multiple sources and then `save_features_matrix` writes them into the H5 artifact.
### Fix Focus Areas
- cytetype/core/artifacts.py[116-135]
- cytetype/core/artifacts.py[196-205]
- cytetype/main.py[269-286]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. Dotplot ordering duplicates 🐞 Bug ✓ Correctness
Description
marker_dotplot appends one cell-type label per cluster into categories_order and passes it to
scanpy.dotplot without de-duplicating. If multiple clusters share the same label, categories_order
will contain duplicates and can cause plotting errors or unexpected ordering.
Code

cytetype/plotting/dotplot.py[R63-105]

+    markers: dict[str, list[str]] = {}
+    categories_order: list[str] = []
+
+    for category in cluster_categories:
+        category_name = category["categoryName"]
+        markers[category_name] = []
+
+        for cluster_id in category["clusterIds"]:
+            cluster_data = raw_annotations.get(cluster_id)
+            if cluster_data is None:
+                logger.warning(
+                    "Cluster '{}' listed in clusterCategories but missing from raw_annotations, skipping.",
+                    cluster_id,
+                )
+                continue
+
+            full_output = cluster_data["latest"]["annotation"]["fullOutput"]
+            cell_type = full_output["cellType"]
+
+            categories_order.append(cell_type["label"])
+            markers[category_name].extend(
+                cell_type["keySupportingGenes"][:n_top_markers]
+            )
+
+        markers[category_name] = sorted(set(markers[category_name]))
+
+    try:
+        import scanpy as sc
+    except ImportError:
+        raise ImportError(
+            "scanpy is required for plotting. Install it with: pip install scanpy"
+        ) from None
+
+    groupby = f"{results_prefix}_annotation_{group_key}"
+
+    return sc.pl.dotplot(
+        adata,
+        markers,
+        groupby=groupby,
+        gene_symbols=gene_symbols,
+        categories_order=categories_order,
+        **kwargs,
+    )
Evidence
categories_order is built by repeated append inside a loop over clusterIds. The groupby column is
created as a categorical annotation column in results storage, so categories_order should typically
be a unique list of category levels.

cytetype/plotting/dotplot.py[63-105]
cytetype/core/results.py[277-280]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`marker_dotplot` can generate `categories_order` with duplicates (one entry per cluster), which is not a robust ordering list for categorical plotting.
### Issue Context
The plot groups by `f&amp;quot;{results_prefix}_annotation_{group_key}&amp;quot;`, which is stored as a pandas categorical.
### Fix Focus Areas
- cytetype/plotting/dotplot.py[63-105]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

5. 60s connect timeout unused🐞 Bug ⛯ Reliability
Description
_upload_file’s default timeout tuple was updated to start with a 60s connect timeout, but
CyteType._upload_artifacts still forces a (30s, read_timeout) tuple. As a result,
CyteType.run()-initiated uploads won’t benefit from the intended connect-timeout increase.
Code

cytetype/main.py[R299-313]

+        uploaded: dict[str, str] = {}
+        errors: list[tuple[str, Exception]] = list(self._artifact_build_errors)
+        timeout = (30.0, float(upload_timeout_seconds))

-    @staticmethod
-    def _cleanup_artifact_files(paths: list[str]) -> None:
-        for artifact_path in paths:
+        # --- vars.h5 upload ---
+        if self._vars_h5_path is not None:
           try:
-                Path(artifact_path).unlink(missing_ok=True)
-            except OSError as exc:
-                logger.warning(f"Failed to cleanup artifact {artifact_path}: {exc}")
+                logger.info("Uploading vars.h5 artifact...")
+                vars_upload = upload_vars_h5_file(
+                    self.api_url,
+                    self.auth_token,
+                    self._vars_h5_path,
+                    timeout=timeout,
+                    max_workers=upload_max_workers,
+                )
Evidence
The upload layer has a new default connect timeout, but the main integration path overrides it with
the old 30s value.

cytetype/api/client.py[37-44]
cytetype/main.py[299-313]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
CyteType overrides the new (60s, read_timeout) default connect timeout with (30s, read_timeout).
### Issue Context
PR intent mentions increasing the connect timeout; current code path in `CyteType.run()` doesn’t use it.
### Fix Focus Areas
- cytetype/main.py[299-313]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@parashardhapola parashardhapola requested a review from suu-yi March 3, 2026 14:44
@parashardhapola parashardhapola changed the title v0.18.0: Raw counts support, backed-mode differential expression, and auto-detection of gene symbols Raw counts support, backed-mode differential expression, and auto-detection of gene symbols Mar 3, 2026
- Added `clean_gene_names` function to extract gene symbols from composite gene names, improving the handling of gene identifiers.
- Updated `extract_marker_genes` to utilize `clean_gene_names` for better gene name management.
- Integrated `clean_gene_names` into the `CyteType` class for consistent gene name processing across the module.
- Enhanced logging to provide insights when composite gene values are cleaned.
…detection

- Enhanced the `_accumulate_group_stats` function to handle both sparse and dense matrix inputs efficiently.
- Implemented conditional logic to process sparse matrices using CSR format, improving memory usage and performance.
- Maintained existing functionality for dense matrices, ensuring compatibility with previous implementations.
- Updated the timeout settings in both `main.py` and `client.py` from 30 seconds to 60 seconds to allow for longer upload durations, improving reliability for larger files.
…e row selection

- Updated the logic to select rows for sampling based on the number of rows in the input matrix.
- Implemented random sampling when the number of rows exceeds the specified sample size, ensuring a more representative subset.
- Maintained functionality for cases where the number of rows is less than or equal to the sample size.
…pling functions

- Added `marker_dotplot` and `subsample_by_group` to the `__all__` list, making them accessible for import.
- This change enhances the module's functionality by exposing additional features for users.
@parashardhapola parashardhapola merged commit 27ae521 into master Mar 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant