Raw counts support, backed-mode differential expression, and auto-detection of gene symbols#67
Raw counts support, backed-mode differential expression, and auto-detection of gene symbols#67parashardhapola merged 19 commits intomasterfrom
Conversation
…ures_matrix - Bump package version to 0.18.0. - Introduce _resolve_raw_counts method in CyteType to improve raw counts extraction from AnnData. - Add _is_integer_valued utility function to check if matrices contain integer values. - Update save_features_matrix to handle raw counts and include them in the output HDF5 file. - Enhance tests to cover new raw counts functionality and integer value checks.
…ing and uploading - Introduced vars_h5_path and obs_duckdb_path parameters in CyteType for customizable artifact paths. - Implemented caching of raw counts and improved error handling during artifact creation. - Updated _upload_artifacts method to handle pre-built artifacts and log errors appropriately. - Modified integration tests to accommodate new parameters and ensure proper artifact cleanup.
- Replaced the static method _cleanup_artifact_files with an instance method cleanup to manage artifact file deletion after run completion. - Removed the cleanup_artifacts parameter from run method, simplifying the interface. - Updated integration tests to verify that cleanup correctly deletes artifact files and clears associated paths.
- Introduced rank_genes_groups_backed in marker_detection.py for memory-efficient gene ranking on backed AnnData objects. - Updated __init__.py files to include rank_genes_groups_backed in the public API of cytetype and preprocessing modules. - Refactored code for improved readability in main.py, enhancing the formatting of artifact cleanup logic.
- Introduced resolve_gene_symbols_column function to auto-detect gene symbols in AnnData, improving flexibility in gene symbol management. - Updated gene_symbols_column type to accept None, allowing for better handling of cases where gene symbols are not explicitly provided. - Refactored aggregate_expression_percentages and extract_marker_genes functions to accommodate the new gene symbol resolution logic. - Enhanced validation in _validate_gene_symbols_column to provide clearer warnings about potential gene ID misclassifications.
… aggregation logic - Increased the default batch size for calculating expression percentages from 2000 to 5000 to optimize memory usage. - Refactored the aggregate_expression_percentages function to utilize a single-pass row-batched accumulation method for improved performance. - Introduced a new _accumulate_group_stats function to streamline the computation of per-group statistics, enhancing efficiency for large datasets. - Updated related documentation to reflect changes in parameters and processing logic.
- Removed unnecessary logging statements for calculating expression percentages and extracting visualization coordinates to streamline output. - Updated logging message for saving obs.duckdb artifact for clarity. - Integrated progress reporting using tqdm for batch processing in save_features_matrix and extract_visualization_coordinates functions. - Improved handling of warnings during batch processing to suppress FutureWarnings from tqdm. - Adjusted progress descriptions for better user feedback during long-running operations.
- Introduced WRITE_MEM_BUDGET constant in config.py to define memory budget for writing artifacts. - Updated logging messages in main.py for clarity during artifact saving processes. - Enhanced progress reporting in artifact writing functions to improve user feedback. - Refactored warning handling to suppress FutureWarnings from tqdm during batch processing. - Added new functions in artifacts.py for improved handling of sparse matrix writing and progress tracking.
- Increased maximum upload size for vars_h5 from 10GB to 50GB to accommodate larger datasets. - Introduced a new ClientDisconnectedError exception to handle client disconnection scenarios. - Improved progress reporting during file uploads by integrating tqdm for better user feedback. - Refactored upload logic to ensure consistent progress updates and error handling across different upload scenarios.
- Introduced a new `subsample_by_group` function in `subsampling.py` to limit the number of cells per group in an AnnData object. - Updated `__init__.py` to include `subsample_by_group` in the public API of the preprocessing module. - Enhanced error handling to check for the existence of the specified group key in the AnnData object. - Added logging to report the results of the subsampling process.
…ng module - Enhanced the `subsample_by_group` function to optimize performance and memory usage during subsampling. - Improved logging to provide clearer insights into the subsampling process and results. - Updated error handling to ensure robustness when dealing with edge cases in AnnData objects. - Refactored related tests to validate the new subsampling logic and logging enhancements.
…nce in the preprocessing module - Modified the `subsample_by_group` function to use `merge="first"` when concatenating subsampled subsets, ensuring that the first occurrence of each observation is retained. - This change enhances the subsampling process by providing a more consistent output when merging groups.
Review Summary by Qodov0.18.0: Raw counts support, backed-mode differential expression, and auto-detection of gene symbols
WalkthroughsDescription• Add raw counts support in vars.h5 artifact with LZ4 compression • Implement memory-efficient rank_genes_groups_backed for backed AnnData • Auto-detect gene symbols column with heuristic scoring algorithm • Restructure artifact building to __init__ and uploading to run() • Add tqdm progress bars throughout data processing pipeline • Increase vars_h5 max upload size from 10GB to 50GB • Implement subsample_by_group utility for per-cluster cell capping • Add marker_dotplot plotting module for category-grouped visualization Diagramflowchart LR
A["CyteType.__init__"] -->|resolve raw counts| B["_resolve_raw_counts"]
A -->|build artifacts| C["save_features_matrix"]
C -->|write normalized| D["_write_csc_via_row_batches<br/>or col_batches"]
C -->|write raw| E["_write_raw_group"]
A -->|build obs| F["save_obs_duckdb_file"]
A -->|auto-detect symbols| G["resolve_gene_symbols_column"]
H["CyteType.run"] -->|upload artifacts| I["_upload_artifacts"]
H -->|cleanup| J["cleanup"]
K["rank_genes_groups_backed"] -->|stream cells| L["_accumulate_group_stats"]
L -->|compute t-test| M["ttest_ind_from_stats"]
N["marker_dotplot"] -->|read results| O["load_local_results"]
File Changes1. cytetype/__init__.py
|
Code Review by Qodo
1. Budget not a hard cap
|
- Added `clean_gene_names` function to extract gene symbols from composite gene names, improving the handling of gene identifiers. - Updated `extract_marker_genes` to utilize `clean_gene_names` for better gene name management. - Integrated `clean_gene_names` into the `CyteType` class for consistent gene name processing across the module. - Enhanced logging to provide insights when composite gene values are cleaned.
…detection - Enhanced the `_accumulate_group_stats` function to handle both sparse and dense matrix inputs efficiently. - Implemented conditional logic to process sparse matrices using CSR format, improving memory usage and performance. - Maintained existing functionality for dense matrices, ensuring compatibility with previous implementations.
- Updated the timeout settings in both `main.py` and `client.py` from 30 seconds to 60 seconds to allow for longer upload durations, improving reliability for larger files.
…e row selection - Updated the logic to select rows for sampling based on the number of rows in the input matrix. - Implemented random sampling when the number of rows exceeds the specified sample size, ensuring a more representative subset. - Maintained functionality for cases where the number of rows is less than or equal to the sample size.
…pling functions - Added `marker_dotplot` and `subsample_by_group` to the `__all__` list, making them accessible for import. - This change enhances the module's functionality by exposing additional features for users.
Summary
This release introduces raw counts embedding in artifacts, a memory-efficient
rank_genes_groups_backedfor on-disk AnnData, automatic gene symbol column detection, cell subsampling, and a category-grouped dotplot utility. Artifact build/upload is restructured for clarity, and progress reporting uses tqdm throughout.✨ What's New
🧬 Raw Counts in
vars.h5Artifactsave_features_matrixfunction now writes an optionalrawgroup to the H5 artifact containing integer raw counts (LZ4-compressed CSR).CyteType.__init__auto-resolves raw counts fromadata.layers['counts'],adata.raw.X, oradata.X(if integer-valued), and embeds them alongside normalized counts.🚀
rank_genes_groups_backed— Memory-Efficient Differential Expressionsc.tl.rank_genes_groupsthat works on backed/on-disk_CSRDatasetmatrices.adata.uns.cytetype.rank_genes_groups_backed.✂️
subsample_by_group— Per-Group Cell Subsamplingmax_cells_per_group), keeping smaller groups intact.🔍 Auto-Detection of Gene Symbols Column
gene_symbols_columnnow defaults toNoneand auto-detects by checking well-known column names (feature_name,gene_symbols, etc.), thenadata.var_names, then a heuristic scan of all var columns.TSPAN6_ENSG00000000003).🎨
marker_dotplot— Category-Grouped Dot Plotcytetype.plotting) withmarker_dotplotthat reads stored CyteType results and creates a scanpy dotplot grouped by cluster categories with top supporting marker genes.⚡ Improvements
🏗️ Artifact Pipeline Restructuring
vars.h5,obs.duckdb) are now built during__init__and uploaded duringrun(), decoupling build from upload.vars_h5_pathandobs_duckdb_pathmoved fromrun()to__init__()parameters.cleanup()method replaces the removedcleanup_artifactsparameter onrun().💾 CSR-Backed Write Path for Normalized Counts
_write_csc_via_row_batches) converts CSR-backed data to CSC in the H5 file without loading the full matrix.WRITE_MEM_BUDGET(default 4 GB).📊 Expression Percentage Calculation
_accumulate_group_stats) instead of gene-batched pandas groupby.pcent_batch_sizeincreased from 2000 to 5000.☁️ Upload Enhancements
vars_h5max upload size increased from 10 GB to 50 GB.tqdmprogress bars when available.📈 Progress Reporting
🐛 Bug Fixes / Error Handling
ClientDisconnectedErrorexception for HTTP 499 /CLIENT_DISCONNECTEDresponses.hasattr(adata.var, gene_symbols_col)check (was always True for DataFrames).rawgroup is cleaned up and skipped with a warning.gene_symbols_columndefault changed from"gene_symbols"toNone(auto-detect).vars_h5_pathandobs_duckdb_pathmoved fromrun()toCyteType.__init__().cleanup_artifactsparameter removed fromrun(); usecleanup()method instead.pcent_batch_sizedefault changed from 2000 to 5000.batch_sizeparameter inaggregate_expression_percentagesrenamed tocell_batch_size.