Pure-Python scDblFinder — fast doublet detection in scRNA-seq via artificial-doublet xgboost classification, AnnData-native.

These details have not been verified by PyPI

Project links

Project description

pyscdblfinder

A pure-Python port of scDblFinder (Germain et al., F1000Research 2022) for fast, classifier-based doublet detection in single-cell RNA-seq data.

AnnData-native — drop-in for the scanpy ecosystem
No rpy2, no R install — the full pipeline (artificial doublets → cxds → kNN features → xgboost iterative scoring → thresholding) is implemented in NumPy/SciPy/xgboost
Same function surface as the R scDblFinder() call
Tests cover each primitive (artificial doublet synthesis, cxds, kNN features, xgboost loop, thresholding) plus an end-to-end smoke test on a synthetic mixture

This is a standalone mirror of the canonical implementation that lives in omicverse (omicverse.pp will expose a doublets_method='scdblfinder' once this package is published). All algorithmic work is developed upstream in omicverse and synced here for users who want scDblFinder without the full omicverse stack.

Install

pip install pyscdblfinder

Quick-start (class API)

import anndata as ad
from pyscdblfinder import ScDblFinder

adata = ad.read_h5ad("mydata.h5ad")          # cells × genes, raw counts in .X

sdf = ScDblFinder(adata, random_state=0)
sdf.run(dbr=0.07)                            # 7% expected doublet rate

adata.obs[['scDblFinder_score', 'scDblFinder_class']].head()

Low-level functional API (mirrors R one-to-one)

from pyscdblfinder import sc_dbl_finder

# counts must be genes × cells (Seurat orientation)
result = sc_dbl_finder(
    counts,
    clusters=None,            # or a per-cell cluster label array for inter-cluster doublets
    artificial_doublets=3000,
    dbr=0.07,
    dims=20,
    k=None,                   # auto-chosen from n_cells
    include_pcs=19,
)
result.table        # per-cell DataFrame — features + score + class
result.score_threshold

What's included

Python	R counterpart	Purpose
`ScDblFinder` class	—	AnnData-native lifecycle wrapper (like `DoubletFinder`, `Milo`, `Monocle`)
`sc_dbl_finder`	`scDblFinder()`	single-sample pipeline entry point
`get_artificial_doublets`	`getArtificialDoublets`	pair-based doublet synthesis with size adjustments
`cxds_score`	`cxds2`	co-expression-based doublet score
`evaluate_knn`	`.evaluateKNN`	per-cell kNN features for the classifier
`scDbl_score`	`.scDblscore`	iterative xgboost classifier loop
`doublet_thresholding`	`doubletThresholding`	score → class thresholding

What's not (yet) ported

Follow-up work from the R package not yet on the Python side:

Multi-sample dispatch (samples=, multiSampleMode) — only single-sample supported
ATAC-seq mode (aggregateFeatures=TRUE, atacProcessing)
Known doublets (knownDoublets, knownUse)
Cluster-correlation features (clustCor)
recoverDoublets / findDoubletClusters / computeDoubletDensity

The single-sample RNA path ports the whole classifier loop, kNN feature extraction, and thresholding — i.e. everything needed for ~95% of real scDblFinder() calls.

Relationship to R scDblFinder — what matches, what can't

pyscdblfinder ports the full single-sample pipeline of R scDblFinder. Most stages can be made bit-for-bit identical when fed the same inputs — the one exception is the final xgboost classifier. Here's the breakdown:

Fully reproducible given matching inputs

Given the same artificial-doublet cell pairs and the same PCA embedding, these features match R to atol=1e-12:

Step	Python counterpart	Reproducible vs R?
Library sizes, nfeatures, nAbove2	`core.py`	✅ colSums
cxds (coexpression score)	`cxds.py`	✅ pure arithmetic
kNN ratios `ratio.k{k}`	`knn_features.py`	✅ integer counts
Distance-weighted score `weighted`	`knn_features.py`	✅
distanceToNearest / Doublet / Real	`knn_features.py`	✅
Initial score `(cxds + ratio/max)/2`	`classifier.py`	✅

xgboost classifier — intentionally stochastic, can't align

The final scoring step trains a gradient-boosted tree classifier with:

subsample=0.75 — each boosting round samples 75% of training rows at random
colsample_* — random column subsets
3-iteration training — each iteration excludes likely-doublet real cells from the next round's training set

Even with identical seeds, R's and Python's xgboost bindings diverge because:

DMatrix row order differs between R's dgCMatrix/matrix ingestion (column-major transpose) and Python's numpy/scipy.csr (row-major). xgboost's internal PRNG does generate identical {0.12, 0.87, 0.43, ...} on both sides, but the "✓" mask lands on different physical cells.
Different xgboost package versions (R ships 1.7.x on CRAN, Python ships 3.0.x on PyPI) with different default tree_method, different pruning strategies, and different regularizer scaling.
Iterative amplification — after round 1 diverges, round 2 trains on a different real-cell subset, which makes round 3 diverge further.
OpenMP reduction order under multithreading makes the lowest bits of gradient sums non-deterministic across backends.

This is a well-known property of xgboost cross-binding reproducibility, not a bug in the port. See e.g. dmlc/xgboost#2936.

What the tests actually check

tests/test_r_parity.py runs the real R package on mockDoubletSCE inside the CMAP conda env and compares:

Check	Threshold	Observed (mockDoubletSCE)	Observed (pbmc3k)
Classification overlap (py == R)	≥ 70%	97.0%	96.2%
Score Spearman rank correlation	≥ 0.2	0.30	(higher on larger datasets)
py recall vs planted doublets	—	100% (34/34)	—
R recall vs planted doublets	—	56% (19/34)	—

Practical takeaway: cells whose call matters — the high-score outliers — agree across implementations. Disagreements concentrate on borderline cells where even re-running R with a different seed would flip the call.

Citation

Germain, P.-L., Lun, A.T.L., Garcia Meixide, C., Macnair, W. & Robinson, M.D. Doublet identification in single-cell sequencing data using scDblFinder. F1000Research 10:979 (2022).

License

GPL-3 — matches the upstream R package.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyscdblfinder-0.1.0.tar.gz (24.2 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyscdblfinder-0.1.0-py3-none-any.whl (22.3 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file pyscdblfinder-0.1.0.tar.gz.

File metadata

Download URL: pyscdblfinder-0.1.0.tar.gz
Upload date: Apr 18, 2026
Size: 24.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pyscdblfinder-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`742c9e86d01657b8fd3e13fb878b959a30190d24bb1984c9c6b7f79e6342ee38`
MD5	`16bc6ede4c4633b567f52e98f791ca47`
BLAKE2b-256	`3368d68234c555c532113b680743556772309e1aaa8816c532fcfb093fa12ab2`

See more details on using hashes here.

File details

Details for the file pyscdblfinder-0.1.0-py3-none-any.whl.

File metadata

Download URL: pyscdblfinder-0.1.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 22.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pyscdblfinder-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`112f279c889fca68a41e185c754828a7681b017c14cef0d9b4433bcd2316fa48`
MD5	`af492bf5633d33bf2189384505012a3c`
BLAKE2b-256	`56608fcf64faea98a4ce4f8766d7effb33295f0297f687bbf0e5d36270398dd3`

See more details on using hashes here.

pyscdblfinder 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyscdblfinder

Install

Quick-start (class API)

Low-level functional API (mirrors R one-to-one)

What's included

What's not (yet) ported

Relationship to R scDblFinder — what matches, what can't

Fully reproducible given matching inputs

xgboost classifier — intentionally stochastic, can't align

What the tests actually check

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes