Pure-Python port of the R package scCDC — entropy-based, gene-specific ambient-RNA contamination detection and correction for scRNA-seq / snRNA-seq.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Starlitnightly

These details have not been verified by PyPI

Project links

Project description

pysccdc

A pure-Python re-implementation of scCDC (Wang et al., Genome Biology 2024) for entropy-based, gene-specific ambient-RNA contamination detection and correction in single-cell / single-nucleus RNA-seq data.

AnnData-native — drop-in for the scanpy / omicverse ecosystem
No rpy2, no R install — the Shannon-entropy core, the bootstrapped smoothing-spline curve fit, the normal-tail FDR, the AUROC and the Youden-index thresholding are all implemented directly in NumPy/SciPy
Same function surface as the R workflow (ContaminationDetection → ContaminationQuantification → ContaminationCorrection)
Bit-for-bit reproducibility against the R reference for the deterministic core — per-gene/per-cluster entropy and the corrected count matrix match scCDC exactly (see tests/test_r_parity.py)

Unlike DecontX, SoupX, CellBender or scAR — which correct every gene — scCDC detects the small set of Global Contamination-causing Genes (GCGs) and corrects only those, avoiding the over-correction of lowly / non-contaminating genes (many of which are real cell-type markers). It needs no empty-droplet data.

This is a standalone mirror of the canonical implementation that lives in omicverse. All algorithmic work is developed upstream in omicverse and synced here for users who want scCDC without the full omicverse stack.

Install

pip install pysccdc

or, from a checkout:

pip install -e .

Dependencies: numpy, scipy, pandas, anndata, scikit-learn. No R, no rpy2.

How it works

Observed entropy — for every gene in every cell cluster, compute the Shannon entropy (base 2) of its count distribution across droplets. A gene smeared across many droplets at a near-constant low ambient level has a concentrated count distribution → low entropy.
Expected entropy curve — fit the expected entropy as a smooth function of log1p(mean expression) with a bootstrapped, outlier-trimmed smoothing spline learnt from presumed-clean genes.
Entropy divergence = expected − observed entropy. A gene with significant positive divergence (normal-tail p, FDR ≤ 0.05) in more than restriction_factor of clusters — and expressed in enough cells in every cluster — is flagged a GCG.
Gene-specific correction — for each GCG, rank clusters by log-normalized expression, compute per-cluster AUROC vs the lowest-expressing cluster, split into eGCG-positive / -negative, then take the Youden-index count threshold on the pooled count distributions and subtract round(threshold) (floored at zero). Non-GCG genes are left untouched — scCDC's anti-over-correction design.

Quick-start

import pysccdc as cd

# bundled synthetic dataset: 4 clusters x 200 cells, 120 genes,
# 4 deliberately-spiked contaminating genes
adata = cd.datasets.simulate_contaminated(random_state=0)

# 1) detect GCGs
detection = cd.ContaminationDetection(adata, cluster_key="cluster")
detection                       # degree-of-contamination table (GCGs)
detection.attrs["GCGs"]         # the GCG list

# 2) quantify dataset-level contamination
ratio = cd.ContaminationQuantification(adata, detection,
                                       cluster_key="cluster")

# 3) correct only the GCGs
corrected = cd.ContaminationCorrection(adata, detection,
                                       cluster_key="cluster")
corrected.layers["Corrected"]          # decontaminated count matrix
corrected.uns["sccdc"]["thresholds"]   # per-GCG subtraction thresholds

scCDC works on a filtered, clustered count matrix; any AnnData with raw integer counts in .X (or a named layer) and a categorical cluster label in .obs works. See examples/tutorial_standalone.py for an end-to-end run on the bundled clustered PBMC 3k dataset (data/pbmc3k_clustered.h5ad).

Low-level functional API (mirrors R one-to-one)

from pysccdc import (
    ContaminationDetection, ContaminationQuantification, ContaminationCorrection,
    generate_curve, vector_entropy, matrix_entropy,
    SmoothSpline, smooth_spline, simple_roc, youden_threshold,
)

# Shannon entropy of a single gene's count distribution
matrix_entropy(counts_genes_by_cells)        # one entropy per gene

# Fit one cluster's entropy-vs-expression curve directly
generate_curve(df_with_Gene_meanexpr_entropy, spar=1.0)

# AUROC and the Youden-index cut point
simple_roc(expr, cls)
youden_threshold(neg_counts, pos_counts)

What's included

Python	R counterpart	Purpose
`ContaminationDetection`	`ContaminationDetection`	detect GCGs; per-cluster entropy divergence table
`ContaminationQuantification`	`ContaminationQuantification`	dataset-level contamination ratio from the GCGs
`ContaminationCorrection`	`ContaminationCorrection`	Youden-threshold correction of the GCGs only
`generate_curve`	`generate_curve`	fit one cluster's entropy-vs-expression curve
`vector_entropy` / `matrix_entropy`	`VectorToEntropy` / `MatrixToEntropy`	Shannon entropy of count distributions
`SmoothSpline` / `smooth_spline`	`smooth.spline`	penalized cubic B-spline
`simple_roc` / `youden_threshold`	`simple_roc` / `Cal_thres`	AUROC and Youden-index cut point
`datasets.simulate_contaminated`	—	synthetic clustered counts with spiked GCGs

Reproducing R results exactly

tests/ runs the same synthetic dataset through the R package scCDC 1.4 (tests/r_reference_driver.R) and pysccdc, and asserts agreement:

per-gene / per-cluster Shannon entropy — bit-exact (the Rcpp MatrixToEntropy reduces to a deterministic numpy.bincount);
detected GCG list — identical on the deliberately-spiked synthetic dataset;
corrected count matrix — bit-exact (the Youden-threshold path is fully deterministic);
contamination ratio — bit-exact;
per-gene entropy divergence — Pearson r > 0.99.

Unavoidable difference. The entropy-vs-expression curve is fit by a bootstrapped smoothing spline (10 rounds, 80% gene resampling). Two things differ from R: (i) R's sample() (Mersenne-Twister) and NumPy's PCG64 draw different bootstrap subsets, and (ii) R's smooth.spline uses an internal knot-thinning heuristic and GCV machinery that the scipy penalized cubic B-spline reproduces only up to ~1e-3 in entropy units. These propagate into the entropy divergence (hence r > 0.99 rather than bit-exact), and on a real noisy dataset can move a few borderline genes across the FDR cutoff in the GCG list — but not into the corrected matrix, which matches exactly given the same GCG list. Fix random_state for reproducible Python runs. The examples/compare_R_vs_Python.ipynb notebook demonstrates this on real PBMC 3k data.

Examples

examples/ mirrors the reference layout:

r_driver_sccdc.R — drives R scCDC end-to-end, dumps entropy / GCG / distance / corrected-matrix outputs
compare_R_vs_Python.ipynb (+ .executed.ipynb) — runs R scCDC via Rscript and pysccdc on the bundled clustered PBMC 3k dataset and visualizes the agreement (entropy bit-exact, divergence correlation, GCG-set Venn, bit-exact corrected matrix) via omicverse.pl.*
tutorial_standalone.py — minimal end-to-end pysccdc pipeline
benchmark.py — head-to-head speed comparison

Relationship to omicverse

Developed upstream in omicverse:

Canonical implementation: omicverse single-cell decontamination
Standalone mirror (this repo): same code, same API, minus the omicverse packaging

Citation

If you use this package, please cite the original scCDC paper:

Wang, W. et al. scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data. Genome Biology 25, 122 (2024).

and acknowledge omicverse / this repo for the Python port.

License

Apache-2.0. The upstream R package scCDC is GPL (≥ 2); pysccdc is an independent re-implementation from the published algorithm and the scCDC source.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Starlitnightly

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysccdc-0.1.0.tar.gz (33.3 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pysccdc-0.1.0-py3-none-any.whl (28.5 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file pysccdc-0.1.0.tar.gz.

File metadata

Download URL: pysccdc-0.1.0.tar.gz
Upload date: May 22, 2026
Size: 33.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pysccdc-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`eb4fe27a4ba760469e44714a85a381a2a2d6ab020cfde35f849ea0765a044f67`
MD5	`bbcceb4be1b891b8e8954b4aa0d8dfc3`
BLAKE2b-256	`4899d4fc53d2938b6b449252219657951b8e0bad64023e3dcdbfd23f0835e287`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pysccdc-0.1.0.tar.gz:

Publisher: publish.yml on omicverse/py-sccdc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pysccdc-0.1.0.tar.gz
- Subject digest: eb4fe27a4ba760469e44714a85a381a2a2d6ab020cfde35f849ea0765a044f67
- Sigstore transparency entry: 1599237789
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: omicverse/py-sccdc@61e83e82893f8cbdb506f28f69fefafcb82dc1d5
- Branch / Tag: refs/heads/master
- Owner: https://github.com/omicverse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@61e83e82893f8cbdb506f28f69fefafcb82dc1d5
- Trigger Event: workflow_dispatch

File details

Details for the file pysccdc-0.1.0-py3-none-any.whl.

File metadata

Download URL: pysccdc-0.1.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pysccdc-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`06154fc40bda6b3073b9039b9023141d5c0b98e7049a09bd52503a3cafa3ef56`
MD5	`b9be129bbf4448661de2154b251f29f7`
BLAKE2b-256	`374e52d4e077dd4cdcf1d28e2a7f038851da87cfa52636fea297044d4f915eaa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pysccdc-0.1.0-py3-none-any.whl:

Publisher: publish.yml on omicverse/py-sccdc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pysccdc-0.1.0-py3-none-any.whl
- Subject digest: 06154fc40bda6b3073b9039b9023141d5c0b98e7049a09bd52503a3cafa3ef56
- Sigstore transparency entry: 1599237872
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: omicverse/py-sccdc@61e83e82893f8cbdb506f28f69fefafcb82dc1d5
- Branch / Tag: refs/heads/master
- Owner: https://github.com/omicverse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@61e83e82893f8cbdb506f28f69fefafcb82dc1d5
- Trigger Event: workflow_dispatch

pysccdc 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pysccdc

Install

How it works

Quick-start

Low-level functional API (mirrors R one-to-one)

What's included

Reproducing R results exactly

Examples

Relationship to omicverse

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance