Pure-Python port of the R package scCDC — entropy-based, gene-specific ambient-RNA contamination detection and correction for scRNA-seq / snRNA-seq.
Project description
pysccdc
A pure-Python re-implementation of scCDC (Wang et al., Genome Biology 2024) for entropy-based, gene-specific ambient-RNA contamination detection and correction in single-cell / single-nucleus RNA-seq data.
- AnnData-native — drop-in for the scanpy / omicverse ecosystem
- No
rpy2, no R install — the Shannon-entropy core, the bootstrapped smoothing-spline curve fit, the normal-tail FDR, the AUROC and the Youden-index thresholding are all implemented directly in NumPy/SciPy - Same function surface as the R workflow (
ContaminationDetection→ContaminationQuantification→ContaminationCorrection) - Bit-for-bit reproducibility against the R reference for the deterministic core — per-gene/per-cluster entropy and the corrected count matrix match scCDC exactly (see
tests/test_r_parity.py)
Unlike DecontX, SoupX, CellBender or scAR — which correct every gene — scCDC detects the small set of Global Contamination-causing Genes (GCGs) and corrects only those, avoiding the over-correction of lowly / non-contaminating genes (many of which are real cell-type markers). It needs no empty-droplet data.
This is a standalone mirror of the canonical implementation that lives in
omicverse. All algorithmic work is developed upstream in omicverse and synced here for users who want scCDC without the full omicverse stack.
Install
pip install pysccdc
or, from a checkout:
pip install -e .
Dependencies: numpy, scipy, pandas, anndata, scikit-learn. No R, no rpy2.
How it works
- Observed entropy — for every gene in every cell cluster, compute the Shannon entropy (base 2) of its count distribution across droplets. A gene smeared across many droplets at a near-constant low ambient level has a concentrated count distribution → low entropy.
- Expected entropy curve — fit the expected entropy as a smooth function of
log1p(mean expression)with a bootstrapped, outlier-trimmed smoothing spline learnt from presumed-clean genes. - Entropy divergence = expected − observed entropy. A gene with significant positive divergence (normal-tail p, FDR ≤ 0.05) in more than
restriction_factorof clusters — and expressed in enough cells in every cluster — is flagged a GCG. - Gene-specific correction — for each GCG, rank clusters by log-normalized expression, compute per-cluster AUROC vs the lowest-expressing cluster, split into eGCG-positive / -negative, then take the Youden-index count threshold on the pooled count distributions and subtract
round(threshold)(floored at zero). Non-GCG genes are left untouched — scCDC's anti-over-correction design.
Quick-start
import pysccdc as cd
# bundled synthetic dataset: 4 clusters x 200 cells, 120 genes,
# 4 deliberately-spiked contaminating genes
adata = cd.datasets.simulate_contaminated(random_state=0)
# 1) detect GCGs
detection = cd.ContaminationDetection(adata, cluster_key="cluster")
detection # degree-of-contamination table (GCGs)
detection.attrs["GCGs"] # the GCG list
# 2) quantify dataset-level contamination
ratio = cd.ContaminationQuantification(adata, detection,
cluster_key="cluster")
# 3) correct only the GCGs
corrected = cd.ContaminationCorrection(adata, detection,
cluster_key="cluster")
corrected.layers["Corrected"] # decontaminated count matrix
corrected.uns["sccdc"]["thresholds"] # per-GCG subtraction thresholds
scCDC works on a filtered, clustered count matrix; any AnnData with raw integer counts in .X (or a named layer) and a categorical cluster label in .obs works. See examples/tutorial_standalone.py for an end-to-end run on the bundled clustered PBMC 3k dataset (data/pbmc3k_clustered.h5ad).
Low-level functional API (mirrors R one-to-one)
from pysccdc import (
ContaminationDetection, ContaminationQuantification, ContaminationCorrection,
generate_curve, vector_entropy, matrix_entropy,
SmoothSpline, smooth_spline, simple_roc, youden_threshold,
)
# Shannon entropy of a single gene's count distribution
matrix_entropy(counts_genes_by_cells) # one entropy per gene
# Fit one cluster's entropy-vs-expression curve directly
generate_curve(df_with_Gene_meanexpr_entropy, spar=1.0)
# AUROC and the Youden-index cut point
simple_roc(expr, cls)
youden_threshold(neg_counts, pos_counts)
What's included
| Python | R counterpart | Purpose |
|---|---|---|
ContaminationDetection |
ContaminationDetection |
detect GCGs; per-cluster entropy divergence table |
ContaminationQuantification |
ContaminationQuantification |
dataset-level contamination ratio from the GCGs |
ContaminationCorrection |
ContaminationCorrection |
Youden-threshold correction of the GCGs only |
generate_curve |
generate_curve |
fit one cluster's entropy-vs-expression curve |
vector_entropy / matrix_entropy |
VectorToEntropy / MatrixToEntropy |
Shannon entropy of count distributions |
SmoothSpline / smooth_spline |
smooth.spline |
penalized cubic B-spline |
simple_roc / youden_threshold |
simple_roc / Cal_thres |
AUROC and Youden-index cut point |
datasets.simulate_contaminated |
— | synthetic clustered counts with spiked GCGs |
Reproducing R results exactly
tests/ runs the same synthetic dataset through the R package scCDC 1.4 (tests/r_reference_driver.R) and pysccdc, and asserts agreement:
- per-gene / per-cluster Shannon entropy — bit-exact (the Rcpp
MatrixToEntropyreduces to a deterministicnumpy.bincount); - detected GCG list — identical on the deliberately-spiked synthetic dataset;
- corrected count matrix — bit-exact (the Youden-threshold path is fully deterministic);
- contamination ratio — bit-exact;
- per-gene entropy divergence — Pearson r > 0.99.
Unavoidable difference. The entropy-vs-expression curve is fit by a bootstrapped smoothing spline (10 rounds, 80% gene resampling). Two things differ from R: (i) R's sample() (Mersenne-Twister) and NumPy's PCG64 draw different bootstrap subsets, and (ii) R's smooth.spline uses an internal knot-thinning heuristic and GCV machinery that the scipy penalized cubic B-spline reproduces only up to ~1e-3 in entropy units. These propagate into the entropy divergence (hence r > 0.99 rather than bit-exact), and on a real noisy dataset can move a few borderline genes across the FDR cutoff in the GCG list — but not into the corrected matrix, which matches exactly given the same GCG list. Fix random_state for reproducible Python runs. The examples/compare_R_vs_Python.ipynb notebook demonstrates this on real PBMC 3k data.
Examples
examples/ mirrors the reference layout:
r_driver_sccdc.R— drives R scCDC end-to-end, dumps entropy / GCG / distance / corrected-matrix outputscompare_R_vs_Python.ipynb(+.executed.ipynb) — runs R scCDC viaRscriptandpysccdcon the bundled clustered PBMC 3k dataset and visualizes the agreement (entropy bit-exact, divergence correlation, GCG-set Venn, bit-exact corrected matrix) viaomicverse.pl.*tutorial_standalone.py— minimal end-to-end pysccdc pipelinebenchmark.py— head-to-head speed comparison
Relationship to omicverse
Developed upstream in omicverse:
- Canonical implementation: omicverse single-cell decontamination
- Standalone mirror (this repo): same code, same API, minus the omicverse packaging
Citation
If you use this package, please cite the original scCDC paper:
Wang, W. et al. scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data. Genome Biology 25, 122 (2024).
and acknowledge omicverse / this repo for the Python port.
License
Apache-2.0. The upstream R package scCDC is GPL (≥ 2); pysccdc is an independent re-implementation from the published algorithm and the scCDC source.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysccdc-0.1.0.tar.gz.
File metadata
- Download URL: pysccdc-0.1.0.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb4fe27a4ba760469e44714a85a381a2a2d6ab020cfde35f849ea0765a044f67
|
|
| MD5 |
bbcceb4be1b891b8e8954b4aa0d8dfc3
|
|
| BLAKE2b-256 |
4899d4fc53d2938b6b449252219657951b8e0bad64023e3dcdbfd23f0835e287
|
Provenance
The following attestation bundles were made for pysccdc-0.1.0.tar.gz:
Publisher:
publish.yml on omicverse/py-sccdc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pysccdc-0.1.0.tar.gz -
Subject digest:
eb4fe27a4ba760469e44714a85a381a2a2d6ab020cfde35f849ea0765a044f67 - Sigstore transparency entry: 1599237789
- Sigstore integration time:
-
Permalink:
omicverse/py-sccdc@61e83e82893f8cbdb506f28f69fefafcb82dc1d5 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/omicverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@61e83e82893f8cbdb506f28f69fefafcb82dc1d5 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pysccdc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pysccdc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06154fc40bda6b3073b9039b9023141d5c0b98e7049a09bd52503a3cafa3ef56
|
|
| MD5 |
b9be129bbf4448661de2154b251f29f7
|
|
| BLAKE2b-256 |
374e52d4e077dd4cdcf1d28e2a7f038851da87cfa52636fea297044d4f915eaa
|
Provenance
The following attestation bundles were made for pysccdc-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on omicverse/py-sccdc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pysccdc-0.1.0-py3-none-any.whl -
Subject digest:
06154fc40bda6b3073b9039b9023141d5c0b98e7049a09bd52503a3cafa3ef56 - Sigstore transparency entry: 1599237872
- Sigstore integration time:
-
Permalink:
omicverse/py-sccdc@61e83e82893f8cbdb506f28f69fefafcb82dc1d5 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/omicverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@61e83e82893f8cbdb506f28f69fefafcb82dc1d5 -
Trigger Event:
workflow_dispatch
-
Statement type: