Pure-Python port of the Bioconductor decontX package — decontamination of ambient RNA in single-cell RNA-seq via a Bayesian two-component multinomial mixture.
Project description
pydecontx
A pure-Python re-implementation of DecontX (Yang et al., Genome Biology 2020) for decontamination of ambient / cross-contaminating RNA in droplet single-cell RNA-seq data.
- AnnData-native — drop-in for the scanpy ecosystem
- No
rpy2, no R install — the variational-EM, the native/contamination multinomial mixture, and the Minka fixed-point Dirichlet MLE are all implemented directly in NumPy/SciPy - Same function surface as the R workflow (
decontX→ per-cell contamination fraction + decontaminated count matrix) - Numerical reproducibility against the Bioconductor reference — contamination fraction, decontaminated matrix and per-cell
thetaagree at Pearson r > 0.99 (seetests/test_r_parity.py)
This is a standalone mirror of the canonical implementation that lives in
omicverse. All algorithmic work is developed upstream in omicverse and synced here for users who want DecontX without the full omicverse stack.
Install
pip install pydecontx
Dependencies: numpy, scipy, pandas, anndata.
Quick-start (function API)
import pydecontx as dx
# 1) a synthetic contaminated dataset (genes x cells) with ground truth
sim = dx.simulate_contamination(C=300, G=100, K=3, delta=(1, 10))
# 2) run DecontX with cluster labels
res = dx.decontx(sim["observed_counts"], z=sim["z"])
res.contamination # per-cell contamination fraction (0-1)
res.decontx_counts # decontaminated sparse count matrix
res.decontaminated_counts() # ... integer-rounded
res.to_dataframe() # per-cell summary table
Quick-start (AnnData)
import scanpy as sc, pydecontx as dx
adata = sc.read_10x_h5("filtered.h5") # cells × genes, raw counts in .X
sc.pp.pca(adata); sc.pp.neighbors(adata); sc.tl.leiden(adata)
# write results back into the AnnData
adata = dx.decontx(adata, z="leiden", copy=True)
adata.obs["decontX_contamination"] # per-cell contamination
adata.layers["decontX_counts"] # decontaminated counts
x may be an AnnData (cells × genes), a pandas.DataFrame, a NumPy array or a SciPy sparse matrix (genes × cells). z is required — DecontX needs a broad clustering of cell types; when x is an AnnData it may name a column of .obs. Empty droplets are not required, but an optional background matrix of raw empty droplets can be supplied to anchor the contamination distribution.
The model
DecontX models each cell's observed counts as a Bayesian two-component multinomial mixture:
- a native distribution
phi_k— the gene probabilities of the cell's own populationk; - a contamination distribution
eta_k— a weighted blend of every other population's native distribution; - a per-cell latent
theta_j ~ Betagiving the proportion of native counts, with a Bernoulli native/contaminant label per transcript.
Inference is variational EM: variational distributions over theta and the transcript labels maximise the ELBO; phi, eta and the Dirichlet hyper-parameter delta are re-estimated each iteration (the latter by a Minka fixed-point Dirichlet MLE — a port of MCMCprecision::fit_dirichlet). The output is a per-cell contamination fraction and a decontaminated (native) count matrix.
Low-level functional API (mirrors R one-to-one)
from pydecontx import (
decontx, DecontXResult,
decontx_initialize, decontx_em, decontx_loglik,
calculate_native_matrix, fit_dirichlet,
)
# variational-EM building blocks
init = decontx_initialize(counts, theta, z) # phi / eta initialisation
step = decontx_em(counts, colsums, theta, eta, phi, z)
ll = decontx_loglik(counts, theta, eta, phi, z)
# decontaminated-matrix computation
native = calculate_native_matrix(counts, theta, eta, phi, z)
# Minka fixed-point Dirichlet MLE (= MCMCprecision::fit_dirichlet)
alpha = fit_dirichlet(proportions)["alpha"]
What's included
| Python | R counterpart | Purpose |
|---|---|---|
decontx |
decontX |
run DecontX on a count matrix / AnnData |
DecontXResult |
— | result object (contamination, decontx_counts, estimates, …) |
simulate_contamination |
simulateContamination |
simulate a contaminated dataset with ground truth |
decontx_initialize |
internal | phi / eta initialisation |
decontx_em |
decontXEM (C++) |
one variational-EM step |
decontx_loglik |
decontXLogLik (C++) |
ELBO / log-likelihood |
calculate_native_matrix |
calculateNativeMatrix (C++) |
decontaminated-matrix computation |
fit_dirichlet |
MCMCprecision::fit_dirichlet |
Minka fixed-point Dirichlet MLE |
Reproducing R results
examples/compare_R_vs_Python.ipynb runs Bioconductor decontX (via Rscript) and pydecontx on the same real PBMC 3k dataset with identical cluster labels, and shows the per-cell contamination fraction, the decontaminated count matrix and the per-cell theta all agree at Pearson r > 0.99.
tests/test_r_parity.py runs the same synthetic two/three-population contaminated count matrix — with identical cluster labels — through Bioconductor decontX (R) and pydecontx, and asserts agreement on the contamination fraction, the decontaminated matrix, the final theta, and the background-anchored mode.
Unavoidable difference. The variational EM is deterministic given its initial theta, but theta is seeded by a Beta draw: R uses its Mersenne-Twister RNG (stats::rbeta under withr::with_seed), NumPy uses PCG64. The two initialisations differ, so the converged estimates agree to high correlation rather than bit-exactly. R also runs the EM inner loops in C++; pydecontx reproduces them in vectorised NumPy.
Relationship to omicverse
Developed upstream in omicverse:
- Canonical implementation: synced to omicverse
- Standalone mirror (this repo): same code, same API, minus the omicverse packaging
Citation
If you use this package, please cite the original DecontX paper:
Yang, S. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biology 21, 57 (2020).
and acknowledge omicverse / this repo for the Python port.
License
Apache-2.0. The upstream Bioconductor decontX / celda packages are MIT-licensed; pydecontx is an independent re-implementation from the published algorithm and the decontX source.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydecontx-0.1.0.tar.gz.
File metadata
- Download URL: pydecontx-0.1.0.tar.gz
- Upload date:
- Size: 27.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8de90396177e646b42aa62cce0d43434284181f9b16753e6c1a45025da61ddc
|
|
| MD5 |
70074b58266dbf5fa4d1bca241dc5e50
|
|
| BLAKE2b-256 |
a47e130bb56127eb0227a010aa3e9783b2896c9e9884614f48192b810dfcb2bf
|
Provenance
The following attestation bundles were made for pydecontx-0.1.0.tar.gz:
Publisher:
publish.yml on omicverse/py-decontx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydecontx-0.1.0.tar.gz -
Subject digest:
f8de90396177e646b42aa62cce0d43434284181f9b16753e6c1a45025da61ddc - Sigstore transparency entry: 1599068284
- Sigstore integration time:
-
Permalink:
omicverse/py-decontx@4c92a128d90897ad0f8562c44d0ff3278c8aad78 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/omicverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c92a128d90897ad0f8562c44d0ff3278c8aad78 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pydecontx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pydecontx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83330b8ccedc94d4f4983de59a61f303c04fea3933915e4f3cae62bdcce295d3
|
|
| MD5 |
315e13a19244e6d851456591e50dd84a
|
|
| BLAKE2b-256 |
55dff52a2173576065d06c9f61f0322352431fb3a104cb5850c1afb7090aa215
|
Provenance
The following attestation bundles were made for pydecontx-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on omicverse/py-decontx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydecontx-0.1.0-py3-none-any.whl -
Subject digest:
83330b8ccedc94d4f4983de59a61f303c04fea3933915e4f3cae62bdcce295d3 - Sigstore transparency entry: 1599068344
- Sigstore integration time:
-
Permalink:
omicverse/py-decontx@4c92a128d90897ad0f8562c44d0ff3278c8aad78 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/omicverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c92a128d90897ad0f8562c44d0ff3278c8aad78 -
Trigger Event:
workflow_dispatch
-
Statement type: