Python port of R DoubletFinder for scRNA-seq doublet detection
Project description
pyDoubletFinder
Faithful Python port of the R DoubletFinder algorithm for scRNA-seq doublet detection
pyDoubletFinder is a line-by-line Python port of the R DoubletFinder algorithm, designed as a drop-in replacement for projects using scanpy / AnnData without requiring an R environment. Replicates the exact Seurat preprocessing pipeline including LogNormalize, VST, ScaleData, full Euclidean distance matrix, and pANN scoring.
Features
- Line-by-line port - replicates the exact R DoubletFinder algorithm
- Native VST - reimplementation of Seurat v3's
FindVariableFeatures(method="vst")on raw counts - R-matching loess - uses
scikit-misc(degree=2) to match R'sstats::loessexactly - Full preprocessing pipeline - LogNormalize, VST, ScaleData, PCA, distance matrix, pANN
- 94.3% classification agreement with R on matched data (4926 cells)
- 99.5% HVG overlap confirms faithful VST reproduction
- Parameter sweep -
param_sweep_and_summarize()for automatic pK selection via bimodality coefficient - SCTransform approximation - experimental support via Pearson residuals
- scanpy / AnnData native - no R dependencies required
Installation
pip install doubletfinder-py
For exact R-matching loess (recommended):
pip install "doubletfinder-py[loess]"
This installs scikit-misc which provides skmisc.loess — a degree-2 loess matching R's stats::loess. Without it, the library falls back to statsmodels.lowess (degree-1, local linear), which is a close but not identical approximation.
Quick Start
import scanpy as sc
from pydoubletfinder import doublet_finder, model_homotypic
adata = sc.read_10x_h5("sample.h5")
adata.var_names_make_unique()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.layers["counts"] = adata.X.copy()
pK = 0.09
nExp = int(0.075 * adata.n_obs)
# Optional: adjust for homotypic doublets
homo_prop = model_homotypic(adata.obs["cell_type"].values)
nExp = int(nExp * (1 - homo_prop))
adata = doublet_finder(adata, PCs=10, pK=pK, nExp=nExp)
col_class = f"DF.classifications_0.25_{pK}_{nExp}"
print(adata.obs[col_class].value_counts())
For pK tuning, annotations, reuse and sparse data see docs/usage.md.
Gallery
| pANN Distribution | PC Selection | Multi-Sample Batch |
Examples
10 runnable scripts covering all features — see docs/examples.md for the full list with previews.
cd examples && python generate_all.py
Automatic pK selection
from pydoubletfinder import param_sweep_and_summarize
sweep_df = param_sweep_and_summarize(adata, PCs=10)
best_pK = float(sweep_df.loc[sweep_df["BCreal"].idxmax(), "pK"])
Note: the parameter sweep is computationally expensive. For most datasets, a fixed pK=0.09 is a reasonable starting point.
Benchmark vs R
Tested on snRNA-seq mouse EAM data (sample42, D0, 4926 cells) using identical doublet pairs (same random seed exported from R):
| Metric | Value |
|---|---|
| Classification agreement | 94.32% |
| pANN Pearson r | 0.8236 |
| pANN Spearman r | 0.8477 |
| HVG overlap (VST) | 1990 / 2000 (99.5%) |
| Cohen's κ | 0.5899 |
Where the ~6% discrepancy comes from
| Source | Impact | Details |
|---|---|---|
| PCA solver | ~5.5% | R uses irlba (Seurat), Python uses ARPACK (scanpy.tl.pca) |
| HVG selection (VST) | ~0.5% | 10 different genes out of 2000 — negligible |
The ~6% discrepancy is a fundamental property of the port — R's irlba and Python's SVD solvers use different numerical paths. All 280 cells classified differently (140 in each direction of the confusion matrix) have pANN values within ~0.01 of the decision threshold. No solver swap can reliably fix this without reimplementing irlba line-for-line in Python.
API
doublet_finder(adata, PCs, pK, nExp, pN=0.25, ...)
Core doublet prediction function. Adds two columns to adata.obs:
pANN_{pN}_{pK}_{nExp}— doublet score (proportion of artificial nearest neighbours)DF.classifications_{pN}_{pK}_{nExp}—"Singlet"or"Doublet"
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
adata |
AnnData |
— | Input object. Raw counts in adata.layers["counts"], adata.raw.X, or adata.X. |
PCs |
int or list[int] |
— | Number of PCs or list of 1-based PC indices. |
pK |
float |
— | Neighbourhood proportion for pANN computation. |
nExp |
int |
— | Expected number of doublets (classification threshold). |
pN |
float |
0.25 |
Proportion of artificial doublets to generate. |
reuse_pANN |
str or None |
None |
Existing adata.obs column with precomputed pANN — skips heavy computation. |
sct |
bool |
False |
Use SCTransform-like normalisation (experimental). |
annotations |
array or None |
None |
Cell-type labels. Adds DF.doublet.contributors_* columns. |
scale_factor |
float |
1e4 |
Target sum for normalisation. |
n_top_genes |
int |
2000 |
Number of HVGs for VST. |
loess_span |
float |
0.3 |
Span for loess in VST. |
scale_max |
float |
10 |
Clip value for ScaleData. |
random_state |
int |
0 |
PCA seed. |
model_homotypic(annotations)
Estimates the proportion of homotypic doublets from cell type annotations. Returns sum(p_i^2) where p_i is the proportion of cell type i. Replicates R's modelHomotypic.
param_sweep_and_summarize(adata, PCs, ...)
Runs a pN–pK parameter sweep and returns a DataFrame with columns pN, pK, BCreal (bimodality coefficient). Select the pK that maximises BCreal.
Differences from R DoubletFinder
| Aspect | R | Python |
|---|---|---|
| Normalisation | NormalizeData (Seurat) |
sc.pp.normalize_total + sc.pp.log1p |
| HVG selection | FindVariableFeatures(method="vst") |
Native reimplementation (_seurat_vst) |
| Scaling | ScaleData (Seurat) |
sc.pp.scale |
| PCA | irlba via RunPCA |
ARPACK via sc.tl.pca |
| Distance matrix | fields::rdist |
scipy.spatial.distance.cdist |
| Loess (VST) | stats::loess (degree=2) |
skmisc.loess (degree=2) or statsmodels.lowess fallback |
Benchmarks
To reproduce the benchmark comparing this implementation against R DoubletFinder:
SAMPLE_H5=/path/to/sample.h5 bash benchmarks/benchmark.sh
Requires Docker. On first run, builds an image with R 4.4 + Seurat + Python (~10 min). Subsequent runs reuse the cached image.
Results are written to benchmarks/results/:
comparison_report.txt— full metrics summaryplots/pann_scatter.png— pANN correlation scatterplots/pann_hist.png— pANN distribution overlayplots/confusion.png— classification confusion matrixplots/hvg_overlap.png— HVG overlap bar chart
Citation
If you use pyDoubletFinder in a publication, please cite both this package and the original DoubletFinder paper:
APA:
dam2452. (2026). pyDoubletFinder: Python port of the R DoubletFinder algorithm (Version 1.0.0). https://github.com/dam2452/pydoubletfinder
McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. (2019). DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Systems, 8, 329–337.e4. https://doi.org/10.1016/j.cels.2019.03.003
BibTeX:
@software{pydoubletfinder2026,
title = {pyDoubletFinder: Python port of the R DoubletFinder algorithm},
author = {dam2452},
year = {2026},
version = {1.0.0},
url = {https://github.com/dam2452/pydoubletfinder}
}
@article{mcginnis2019doubletfinder,
title = {{DoubletFinder}: Doublet Detection in Single-Cell {RNA} Sequencing Data Using Artificial Nearest Neighbors},
author = {McGinnis, Christopher S. and Murrow, Lydia M. and Gartner, Zev J.},
journal = {Cell Systems},
volume = {8},
number = {4},
pages = {329--337.e4},
year = {2019},
doi = {10.1016/j.cels.2019.03.003}
}
Contributing
Contributions are welcome! Here's how you can help:
- Bug reports - Open an issue with a minimal reproducible example
- Feature requests - Open an issue describing the use case
- Code contributions - Fork, create a feature branch, and open a pull request
Development setup
git clone https://github.com/dam2452/pydoubletfinder.git
cd pydoubletfinder
pip install -e ".[dev]"
pytest tests/
License
This project is licensed under the MIT License - see LICENSE for full details.
Reference
McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Systems 8, 329–337.e4 (2019). https://doi.org/10.1016/j.cels.2019.03.003
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doubletfinder_py-1.1.0.tar.gz.
File metadata
- Download URL: doubletfinder_py-1.1.0.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b3ab85a8769c7ef05ac8cb029022030d5be1636779eb4ecd9f0da75826972a0
|
|
| MD5 |
80af54910a052fef6a258893711c4c5a
|
|
| BLAKE2b-256 |
6882d18543e02ebc6b643f4c536afc06321597dcd28f18dd7332ba1aae03e353
|
File details
Details for the file doubletfinder_py-1.1.0-py3-none-any.whl.
File metadata
- Download URL: doubletfinder_py-1.1.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86e615339f0dadcbd22d0a892d1927d505dd12c27be3971ae8495fc4c90eb965
|
|
| MD5 |
b7d385b5d5518a71e6188fc7f30dd1f7
|
|
| BLAKE2b-256 |
d0ae71634cff76169e16def7ebc1ca5134f3fde105f8fea4c1ad1c015a24158b
|