Memory-efficient streaming analysis of large-scale CRISPR and Perturb-seq screens on disk-backed AnnData files
Project description
crispyx
Motivation
Genome-wide CRISPR screens routinely produce datasets with hundreds of thousands of cells and tens of thousands of genes. Standard single-cell analysis toolkits (Scanpy, Pertpy) load the entire count matrix into memory, which can require 30–100+ GB of RAM and makes many screens impractical to analyse on commodity hardware or shared HPC nodes with per-job memory limits.
crispyx solves this by streaming data directly from on-disk AnnData (.h5ad) files. Quality control, normalisation, pseudo-bulk aggregation, and differential expression all operate without materialising the full matrix in memory, so even the largest screens can be processed with modest resources.
Features
- Streaming QC & preprocessing – Filter cells, perturbations, and genes; normalise and log-transform; all without loading the full matrix into memory
- Pseudo-bulk aggregation – Average log expression and pseudo-bulk count matrices for effect size estimation
- Differential expression – t-test, Wilcoxon rank-sum, and negative binomial GLM with apeGLM LFC shrinkage; multi-core support and adaptive memory management; per-condition low-expression filtering to exclude genes that are near-zero in both groups
- Dimension reduction – Memory-efficient PCA and KNN graph construction on backed data
- Scanpy-compatible API & plotting – Familiar
cx.pp,cx.pb,cx.tl, andcx.plnamespaces; Scanpy-style rank genes plots, volcano, MA, PCA, UMAP, QC summaries, and overlap heatmaps - Data preparation utilities – Edit backed metadata without loading X; standardise gene names; normalise perturbation labels; auto-detect metadata columns
- HPC-ready – Resume/checkpoint for long-running jobs; configurable
memory_limit_gb; Docker and Singularity support
Quick Start
import crispyx as cx
# Open dataset without loading into memory
adata = cx.read_h5ad_ondisk("data/demo_benchmark.h5ad")
# Quality control with adaptive thresholds
adata = cx.pp.qc_summary(
adata,
perturbation_column="perturbation",
min_genes=5,
min_cells_per_perturbation=5,
)
# Differential expression
adata = cx.tl.rank_genes_groups(
adata,
perturbation_column="perturbation",
method="wilcoxon", # or "t-test", "nb_glm"
)
# Access results
print(adata.uns["rank_genes_groups"])
de_results = adata.uns["rank_genes_groups"].load()
For the full workflow (normalisation, PCA, pseudo-bulk, NB-GLM, LFC shrinkage, plotting, data preparation utilities), see the Usage Guide and the tutorial notebook.
Performance
Benchmarked across 12 CRISPR screen datasets (21k–1.97M cells), crispyx consistently outperforms Scanpy, Pertpy/PyDESeq2, and edgeR in both speed and memory:
| Metric | crispyx vs Scanpy | crispyx vs Pertpy/PyDESeq2 |
|---|---|---|
| t-test | 2–11× faster | — |
| Wilcoxon | 2–43× faster | — |
| NB-GLM | — | 2× faster, completes where Pertpy OOMs |
| Peak memory | 2–6× lower | Runs within 64 GB where Pertpy exceeds 120 GB |
| Accuracy | Pearson r > 0.999 vs Scanpy | Pearson r > 0.97 vs PyDESeq2 |
crispyx succeeds on all 12 datasets, while Scanpy times out or OOMs on the largest screens and Pertpy/edgeR fail on most genome-wide datasets.
See benchmarking/ for full results and reproduction scripts.
Installation
pip install crispyx
For development (editable install with all extras):
git clone https://github.com/jaydu1/crispyx.git
cd crispyx
pip install -e ".[test,benchmark,docs]"
Benchmarking
cd benchmarking
./run_benchmark.sh config/Adamson.yaml # single dataset
./run_benchmark.sh config/*.yaml # all datasets
See benchmarking/README.md for configuration options and output structure.
Testing
pytest
Documentation
sphinx-build docs docs/_build
Acknowledgements
crispyx builds on the foundational work of Scanpy (Wolf et al., 2018), Pertpy, PyDESeq2 (Muzellec et al., 2023), and AnnData (Virshup et al., 2024). We gratefully acknowledge these projects for establishing the single-cell analysis ecosystem in Python; crispyx extends their APIs and algorithmic designs to enable memory-efficient, streaming computation for large-scale CRISPR screen datasets.
Contributing
Suggestions, bug reports, and contributions are welcome! Please open an issue or submit a pull request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crispyx-0.0.3.tar.gz.
File metadata
- Download URL: crispyx-0.0.3.tar.gz
- Upload date:
- Size: 245.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a0db2739e671324ace39a06e2d45962f18515da3aaa583691f761818f716e67
|
|
| MD5 |
9cc169d0b2d2297399ee291de0650542
|
|
| BLAKE2b-256 |
f429ee013a358293e712c699c9ddbee424fa5b483d89d9acfe549c27a6a1e3f9
|
Provenance
The following attestation bundles were made for crispyx-0.0.3.tar.gz:
Publisher:
publish.yml on jaydu1/crispyx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crispyx-0.0.3.tar.gz -
Subject digest:
9a0db2739e671324ace39a06e2d45962f18515da3aaa583691f761818f716e67 - Sigstore transparency entry: 1523527634
- Sigstore integration time:
-
Permalink:
jaydu1/crispyx@3553e0401f92cbb0565bc1abd848e8855585c7b8 -
Branch / Tag:
refs/tags/0.0.3 - Owner: https://github.com/jaydu1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3553e0401f92cbb0565bc1abd848e8855585c7b8 -
Trigger Event:
release
-
Statement type:
File details
Details for the file crispyx-0.0.3-py3-none-any.whl.
File metadata
- Download URL: crispyx-0.0.3-py3-none-any.whl
- Upload date:
- Size: 196.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b7f6ef29ffddaeca33e72bb36b63d343c5f94482b3e7f21bcd139800b68fec8
|
|
| MD5 |
46cfbcefa42f4da54165aaccd30618cb
|
|
| BLAKE2b-256 |
5830cbac7fe28507cf79c0ef02f9055ec03fa3832471c21f598f534aaa17c371
|
Provenance
The following attestation bundles were made for crispyx-0.0.3-py3-none-any.whl:
Publisher:
publish.yml on jaydu1/crispyx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crispyx-0.0.3-py3-none-any.whl -
Subject digest:
4b7f6ef29ffddaeca33e72bb36b63d343c5f94482b3e7f21bcd139800b68fec8 - Sigstore transparency entry: 1523527644
- Sigstore integration time:
-
Permalink:
jaydu1/crispyx@3553e0401f92cbb0565bc1abd848e8855585c7b8 -
Branch / Tag:
refs/tags/0.0.3 - Owner: https://github.com/jaydu1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3553e0401f92cbb0565bc1abd848e8855585c7b8 -
Trigger Event:
release
-
Statement type: