Pure-Python scDblFinder — fast doublet detection in scRNA-seq via artificial-doublet xgboost classification, AnnData-native.
Project description
pyscdblfinder
A pure-Python port of scDblFinder (Germain et al., F1000Research 2022) for fast, classifier-based doublet detection in single-cell RNA-seq data.
- AnnData-native — drop-in for the scanpy ecosystem
- No
rpy2, no R install — the full pipeline (artificial doublets → cxds → kNN features → xgboost iterative scoring → thresholding) is implemented in NumPy/SciPy/xgboost - Same function surface as the R
scDblFinder()call - Tests cover each primitive (artificial doublet synthesis, cxds, kNN features, xgboost loop, thresholding) plus an end-to-end smoke test on a synthetic mixture
This is a standalone mirror of the canonical implementation that lives in
omicverse(omicverse.ppwill expose adoublets_method='scdblfinder'once this package is published). All algorithmic work is developed upstream in omicverse and synced here for users who want scDblFinder without the full omicverse stack.
Install
pip install pyscdblfinder
Quick-start (class API)
import anndata as ad
from pyscdblfinder import ScDblFinder
adata = ad.read_h5ad("mydata.h5ad") # cells × genes, raw counts in .X
sdf = ScDblFinder(adata, random_state=0)
sdf.run(dbr=0.07) # 7% expected doublet rate
adata.obs[['scDblFinder_score', 'scDblFinder_class']].head()
Low-level functional API (mirrors R one-to-one)
from pyscdblfinder import sc_dbl_finder
# counts must be genes × cells (Seurat orientation)
result = sc_dbl_finder(
counts,
clusters=None, # or a per-cell cluster label array for inter-cluster doublets
artificial_doublets=3000,
dbr=0.07,
dims=20,
k=None, # auto-chosen from n_cells
include_pcs=19,
)
result.table # per-cell DataFrame — features + score + class
result.score_threshold
What's included
| Python | R counterpart | Purpose |
|---|---|---|
ScDblFinder class |
— | AnnData-native lifecycle wrapper (like DoubletFinder, Milo, Monocle) |
sc_dbl_finder |
scDblFinder() |
single-sample pipeline entry point |
get_artificial_doublets |
getArtificialDoublets |
pair-based doublet synthesis with size adjustments |
cxds_score |
cxds2 |
co-expression-based doublet score |
evaluate_knn |
.evaluateKNN |
per-cell kNN features for the classifier |
scDbl_score |
.scDblscore |
iterative xgboost classifier loop |
doublet_thresholding |
doubletThresholding |
score → class thresholding |
What's not (yet) ported
Follow-up work from the R package not yet on the Python side:
- Multi-sample dispatch (
samples=,multiSampleMode) — only single-sample supported - ATAC-seq mode (
aggregateFeatures=TRUE,atacProcessing) - Known doublets (
knownDoublets,knownUse) - Cluster-correlation features (
clustCor) - recoverDoublets / findDoubletClusters / computeDoubletDensity
The single-sample RNA path ports the whole classifier loop, kNN feature
extraction, and thresholding — i.e. everything needed for ~95% of real
scDblFinder() calls.
Relationship to R scDblFinder — what matches, what can't
pyscdblfinder ports the full single-sample pipeline of R scDblFinder. Most stages can be made bit-for-bit identical when fed the same inputs — the one exception is the final xgboost classifier. Here's the breakdown:
Fully reproducible given matching inputs
Given the same artificial-doublet cell pairs and the same PCA embedding, these features match R to atol=1e-12:
| Step | Python counterpart | Reproducible vs R? |
|---|---|---|
| Library sizes, nfeatures, nAbove2 | core.py |
✅ colSums |
| cxds (coexpression score) | cxds.py |
✅ pure arithmetic |
kNN ratios ratio.k{k} |
knn_features.py |
✅ integer counts |
Distance-weighted score weighted |
knn_features.py |
✅ |
| distanceToNearest / *Doublet / *Real | knn_features.py |
✅ |
Initial score (cxds + ratio/max)/2 |
classifier.py |
✅ |
xgboost classifier — intentionally stochastic, can't align
The final scoring step trains a gradient-boosted tree classifier with:
subsample=0.75— each boosting round samples 75% of training rows at randomcolsample_*— random column subsets- 3-iteration training — each iteration excludes likely-doublet real cells from the next round's training set
Even with identical seeds, R's and Python's xgboost bindings diverge because:
- DMatrix row order differs between R's
dgCMatrix/matrix ingestion (column-major transpose) and Python'snumpy/scipy.csr(row-major). xgboost's internal PRNG does generate identical{0.12, 0.87, 0.43, ...}on both sides, but the "✓" mask lands on different physical cells. - Different xgboost package versions (R ships 1.7.x on CRAN, Python ships 3.0.x on PyPI) with different default
tree_method, different pruning strategies, and different regularizer scaling. - Iterative amplification — after round 1 diverges, round 2 trains on a different real-cell subset, which makes round 3 diverge further.
- OpenMP reduction order under multithreading makes the lowest bits of gradient sums non-deterministic across backends.
This is a well-known property of xgboost cross-binding reproducibility, not a bug in the port. See e.g. dmlc/xgboost#2936.
What the tests actually check
tests/test_r_parity.py runs the real R package on mockDoubletSCE inside the CMAP conda env and compares:
| Check | Threshold | Observed (mockDoubletSCE) | Observed (pbmc3k) |
|---|---|---|---|
| Classification overlap (py == R) | ≥ 70% | 97.0% | 96.2% |
| Score Spearman rank correlation | ≥ 0.2 | 0.30 | (higher on larger datasets) |
| py recall vs planted doublets | — | 100% (34/34) | — |
| R recall vs planted doublets | — | 56% (19/34) | — |
Practical takeaway: cells whose call matters — the high-score outliers — agree across implementations. Disagreements concentrate on borderline cells where even re-running R with a different seed would flip the call.
Citation
Germain, P.-L., Lun, A.T.L., Garcia Meixide, C., Macnair, W. & Robinson, M.D. Doublet identification in single-cell sequencing data using scDblFinder. F1000Research 10:979 (2022).
License
GPL-3 — matches the upstream R package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyscdblfinder-0.1.0.tar.gz.
File metadata
- Download URL: pyscdblfinder-0.1.0.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
742c9e86d01657b8fd3e13fb878b959a30190d24bb1984c9c6b7f79e6342ee38
|
|
| MD5 |
16bc6ede4c4633b567f52e98f791ca47
|
|
| BLAKE2b-256 |
3368d68234c555c532113b680743556772309e1aaa8816c532fcfb093fa12ab2
|
File details
Details for the file pyscdblfinder-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyscdblfinder-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
112f279c889fca68a41e185c754828a7681b017c14cef0d9b4433bcd2316fa48
|
|
| MD5 |
af492bf5633d33bf2189384505012a3c
|
|
| BLAKE2b-256 |
56608fcf64faea98a4ce4f8766d7effb33295f0297f687bbf0e5d36270398dd3
|