Python port of R DoubletFinder for scRNA-seq doublet detection

These details have not been verified by PyPI

Project links

Project description

pyDoubletFinder

Faithful Python port of the R DoubletFinder algorithm for scRNA-seq doublet detection

Python 3.10+

pyDoubletFinder is a line-by-line Python port of the R DoubletFinder algorithm, designed as a drop-in replacement for projects using scanpy / AnnData without requiring an R environment. Replicates the exact Seurat preprocessing pipeline including LogNormalize, VST, ScaleData, full Euclidean distance matrix, and pANN scoring.

Features

Line-by-line port - replicates the exact R DoubletFinder algorithm
Native VST - reimplementation of Seurat v3's FindVariableFeatures(method="vst") on raw counts
R-matching loess - uses scikit-misc (degree=2) to match R's stats::loess exactly
Full preprocessing pipeline - LogNormalize, VST, ScaleData, PCA, distance matrix, pANN
94.3% classification agreement with R on matched data (4926 cells)
99.5% HVG overlap confirms faithful VST reproduction
Parameter sweep - param_sweep_and_summarize() for automatic pK selection via bimodality coefficient
SCTransform approximation - experimental support via Pearson residuals
scanpy / AnnData native - no R dependencies required

Installation

pip install doubletfinder-py

For exact R-matching loess (recommended):

pip install "doubletfinder-py[loess]"

This installs scikit-misc which provides skmisc.loess — a degree-2 loess matching R's stats::loess. Without it, the library falls back to statsmodels.lowess (degree-1, local linear), which is a close but not identical approximation.

Quick Start

import scanpy as sc
from pydoubletfinder import doublet_finder, model_homotypic

adata = sc.read_10x_h5("sample.h5")
adata.var_names_make_unique()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.layers["counts"] = adata.X.copy()

pK   = 0.09
nExp = int(0.075 * adata.n_obs)

# Optional: adjust for homotypic doublets
homo_prop = model_homotypic(adata.obs["cell_type"].values)
nExp = int(nExp * (1 - homo_prop))

adata = doublet_finder(adata, PCs=10, pK=pK, nExp=nExp)

col_class = f"DF.classifications_0.25_{pK}_{nExp}"
print(adata.obs[col_class].value_counts())

For pK tuning, annotations, reuse and sparse data see docs/usage.md.

Gallery

pANN Distribution	PC Selection	Multi-Sample Batch

Examples

10 runnable scripts covering all features — see docs/examples.md for the full list with previews.

cd examples && python generate_all.py

Automatic pK selection

from pydoubletfinder import param_sweep_and_summarize

sweep_df = param_sweep_and_summarize(adata, PCs=10)
best_pK  = float(sweep_df.loc[sweep_df["BCreal"].idxmax(), "pK"])

Note: the parameter sweep is computationally expensive. For most datasets, a fixed pK=0.09 is a reasonable starting point.

Benchmark vs R

Tested on snRNA-seq mouse EAM data (sample42, D0, 4926 cells) using identical doublet pairs (same random seed exported from R):

Metric	Value
Classification agreement	94.32%
pANN Pearson r	0.8236
pANN Spearman r	0.8477
HVG overlap (VST)	1990 / 2000 (99.5%)
Cohen's κ	0.5899

Where the ~6% discrepancy comes from

Source	Impact	Details
PCA solver	~5.5%	R uses `irlba` (Seurat), Python uses ARPACK (`scanpy.tl.pca`)
HVG selection (VST)	~0.5%	10 different genes out of 2000 — negligible

The ~6% discrepancy is a fundamental property of the port — R's irlba and Python's SVD solvers use different numerical paths. All 280 cells classified differently (140 in each direction of the confusion matrix) have pANN values within ~0.01 of the decision threshold. No solver swap can reliably fix this without reimplementing irlba line-for-line in Python.

API

`doublet_finder(adata, PCs, pK, nExp, pN=0.25, ...)`

Core doublet prediction function. Adds two columns to adata.obs:

pANN_{pN}_{pK}_{nExp} — doublet score (proportion of artificial nearest neighbours)
DF.classifications_{pN}_{pK}_{nExp} — "Singlet" or "Doublet"

Parameters:

Parameter	Type	Default	Description
`adata`	`AnnData`	—	Input object. Raw counts in `adata.layers["counts"]`, `adata.raw.X`, or `adata.X`.
`PCs`	`int` or `list[int]`	—	Number of PCs or list of 1-based PC indices.
`pK`	`float`	—	Neighbourhood proportion for pANN computation.
`nExp`	`int`	—	Expected number of doublets (classification threshold).
`pN`	`float`	`0.25`	Proportion of artificial doublets to generate.
`reuse_pANN`	`str` or `None`	`None`	Existing `adata.obs` column with precomputed pANN — skips heavy computation.
`sct`	`bool`	`False`	Use SCTransform-like normalisation (experimental).
`annotations`	`array` or `None`	`None`	Cell-type labels. Adds `DF.doublet.contributors_*` columns.
`scale_factor`	`float`	`1e4`	Target sum for normalisation.
`n_top_genes`	`int`	`2000`	Number of HVGs for VST.
`loess_span`	`float`	`0.3`	Span for loess in VST.
`scale_max`	`float`	`10`	Clip value for ScaleData.
`random_state`	`int`	`0`	PCA seed.

`model_homotypic(annotations)`

Estimates the proportion of homotypic doublets from cell type annotations. Returns sum(p_i^2) where p_i is the proportion of cell type i. Replicates R's modelHomotypic.

`param_sweep_and_summarize(adata, PCs, ...)`

Runs a pN–pK parameter sweep and returns a DataFrame with columns pN, pK, BCreal (bimodality coefficient). Select the pK that maximises BCreal.

Differences from R DoubletFinder

Aspect	R	Python
Normalisation	`NormalizeData` (Seurat)	`sc.pp.normalize_total` + `sc.pp.log1p`
HVG selection	`FindVariableFeatures(method="vst")`	Native reimplementation (`_seurat_vst`)
Scaling	`ScaleData` (Seurat)	`sc.pp.scale`
PCA	`irlba` via `RunPCA`	ARPACK via `sc.tl.pca`
Distance matrix	`fields::rdist`	`scipy.spatial.distance.cdist`
Loess (VST)	`stats::loess` (degree=2)	`skmisc.loess` (degree=2) or `statsmodels.lowess` fallback

Benchmarks

To reproduce the benchmark comparing this implementation against R DoubletFinder:

SAMPLE_H5=/path/to/sample.h5 bash benchmarks/benchmark.sh

Requires Docker. On first run, builds an image with R 4.4 + Seurat + Python (~10 min). Subsequent runs reuse the cached image.

Results are written to benchmarks/results/:

comparison_report.txt — full metrics summary
plots/pann_scatter.png — pANN correlation scatter
plots/pann_hist.png — pANN distribution overlay
plots/confusion.png — classification confusion matrix
plots/hvg_overlap.png — HVG overlap bar chart

Citation

If you use pyDoubletFinder in a publication, please cite both this package and the original DoubletFinder paper:

APA:

dam2452. (2026). pyDoubletFinder: Python port of the R DoubletFinder algorithm (Version 1.0.0). https://github.com/dam2452/pydoubletfinder

McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. (2019). DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Systems, 8, 329–337.e4. https://doi.org/10.1016/j.cels.2019.03.003

BibTeX:

@software{pydoubletfinder2026,
  title   = {pyDoubletFinder: Python port of the R DoubletFinder algorithm},
  author  = {dam2452},
  year    = {2026},
  version = {1.0.0},
  url     = {https://github.com/dam2452/pydoubletfinder}
}

@article{mcginnis2019doubletfinder,
  title     = {{DoubletFinder}: Doublet Detection in Single-Cell {RNA} Sequencing Data Using Artificial Nearest Neighbors},
  author    = {McGinnis, Christopher S. and Murrow, Lydia M. and Gartner, Zev J.},
  journal   = {Cell Systems},
  volume    = {8},
  number    = {4},
  pages     = {329--337.e4},
  year      = {2019},
  doi       = {10.1016/j.cels.2019.03.003}
}

Contributing

Contributions are welcome! Here's how you can help:

Bug reports - Open an issue with a minimal reproducible example
Feature requests - Open an issue describing the use case
Code contributions - Fork, create a feature branch, and open a pull request

Development setup

git clone https://github.com/dam2452/pydoubletfinder.git
cd pydoubletfinder
pip install -e ".[dev]"
pytest tests/

License

This project is licensed under the MIT License - see LICENSE for full details.

Reference

McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Systems 8, 329–337.e4 (2019). https://doi.org/10.1016/j.cels.2019.03.003

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doubletfinder_py-1.1.0.tar.gz (15.8 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doubletfinder_py-1.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file doubletfinder_py-1.1.0.tar.gz.

File metadata

Download URL: doubletfinder_py-1.1.0.tar.gz
Upload date: Apr 30, 2026
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for doubletfinder_py-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6b3ab85a8769c7ef05ac8cb029022030d5be1636779eb4ecd9f0da75826972a0`
MD5	`80af54910a052fef6a258893711c4c5a`
BLAKE2b-256	`6882d18543e02ebc6b643f4c536afc06321597dcd28f18dd7332ba1aae03e353`

See more details on using hashes here.

File details

Details for the file doubletfinder_py-1.1.0-py3-none-any.whl.

File metadata

Download URL: doubletfinder_py-1.1.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for doubletfinder_py-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86e615339f0dadcbd22d0a892d1927d505dd12c27be3971ae8495fc4c90eb965`
MD5	`b7d385b5d5518a71e6188fc7f30dd1f7`
BLAKE2b-256	`d0ae71634cff76169e16def7ebc1ca5134f3fde105f8fea4c1ad1c015a24158b`

See more details on using hashes here.

doubletfinder-py 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyDoubletFinder

Features

Installation

Quick Start

Gallery

Examples

Automatic pK selection

Benchmark vs R

Where the ~6% discrepancy comes from

API

doublet_finder(adata, PCs, pK, nExp, pN=0.25, ...)

model_homotypic(annotations)

param_sweep_and_summarize(adata, PCs, ...)

Differences from R DoubletFinder

Benchmarks

Citation

Contributing

Development setup

License

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`doublet_finder(adata, PCs, pK, nExp, pN=0.25, ...)`

`model_homotypic(annotations)`

`param_sweep_and_summarize(adata, PCs, ...)`