Skip to main content

Python port of R DoubletFinder for scRNA-seq doublet detection

Project description

pyDoubletFinder

Faithful Python port of the R DoubletFinder algorithm for scRNA-seq doublet detection

PyPI License Python 3.10+


pyDoubletFinder is a line-by-line Python port of the R DoubletFinder algorithm, designed as a drop-in replacement for projects using scanpy / AnnData without requiring an R environment. Replicates the exact Seurat preprocessing pipeline including LogNormalize, VST, ScaleData, full Euclidean distance matrix, and pANN scoring.

Features

  • Line-by-line port - replicates the exact R DoubletFinder algorithm
  • Native VST - reimplementation of Seurat v3's FindVariableFeatures(method="vst") on raw counts
  • R-matching loess - uses scikit-misc (degree=2) to match R's stats::loess exactly
  • Full preprocessing pipeline - LogNormalize, VST, ScaleData, PCA, distance matrix, pANN
  • 94.3% classification agreement with R on matched data (4926 cells)
  • 99.5% HVG overlap confirms faithful VST reproduction
  • Parameter sweep - param_sweep_and_summarize() for automatic pK selection via bimodality coefficient
  • SCTransform approximation - experimental support via Pearson residuals
  • scanpy / AnnData native - no R dependencies required

Installation

pip install doubletfinder-py

For exact R-matching loess (recommended):

pip install "doubletfinder-py[loess]"

This installs scikit-misc which provides skmisc.loess — a degree-2 loess matching R's stats::loess. Without it, the library falls back to statsmodels.lowess (degree-1, local linear), which is a close but not identical approximation.

Quick Start

import scanpy as sc
from pydoubletfinder import doublet_finder, model_homotypic

adata = sc.read_10x_h5("sample.h5")
adata.var_names_make_unique()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.layers["counts"] = adata.X.copy()

pK   = 0.09
nExp = int(0.075 * adata.n_obs)

# Optional: adjust for homotypic doublets
homo_prop = model_homotypic(adata.obs["cell_type"].values)
nExp = int(nExp * (1 - homo_prop))

adata = doublet_finder(adata, PCs=10, pK=pK, nExp=nExp)

col_class = f"DF.classifications_0.25_{pK}_{nExp}"
print(adata.obs[col_class].value_counts())

For pK tuning, annotations, reuse and sparse data see docs/usage.md.

Gallery

pANN Distribution PC Selection Multi-Sample Batch

Examples

10 runnable scripts covering all features — see docs/examples.md for the full list with previews.

cd examples && python generate_all.py

Automatic pK selection

from pydoubletfinder import param_sweep_and_summarize

sweep_df = param_sweep_and_summarize(adata, PCs=10)
best_pK  = float(sweep_df.loc[sweep_df["BCreal"].idxmax(), "pK"])

Note: the parameter sweep is computationally expensive. For most datasets, a fixed pK=0.09 is a reasonable starting point.

Benchmark vs R

Tested on snRNA-seq mouse EAM data (sample42, D0, 4926 cells) using identical doublet pairs (same random seed exported from R):

Metric Value
Classification agreement 94.32%
pANN Pearson r 0.8236
pANN Spearman r 0.8477
HVG overlap (VST) 1990 / 2000 (99.5%)
Cohen's κ 0.5899

Where the ~6% discrepancy comes from

Source Impact Details
PCA solver ~5.5% R uses irlba (Seurat), Python uses ARPACK (scanpy.tl.pca)
HVG selection (VST) ~0.5% 10 different genes out of 2000 — negligible

The ~6% discrepancy is a fundamental property of the port — R's irlba and Python's SVD solvers use different numerical paths. All 280 cells classified differently (140 in each direction of the confusion matrix) have pANN values within ~0.01 of the decision threshold. No solver swap can reliably fix this without reimplementing irlba line-for-line in Python.

API

doublet_finder(adata, PCs, pK, nExp, pN=0.25, ...)

Core doublet prediction function. Adds two columns to adata.obs:

  • pANN_{pN}_{pK}_{nExp} — doublet score (proportion of artificial nearest neighbours)
  • DF.classifications_{pN}_{pK}_{nExp}"Singlet" or "Doublet"

Parameters:

Parameter Type Default Description
adata AnnData Input object. Raw counts in adata.layers["counts"], adata.raw.X, or adata.X.
PCs int or list[int] Number of PCs or list of 1-based PC indices.
pK float Neighbourhood proportion for pANN computation.
nExp int Expected number of doublets (classification threshold).
pN float 0.25 Proportion of artificial doublets to generate.
reuse_pANN str or None None Existing adata.obs column with precomputed pANN — skips heavy computation.
sct bool False Use SCTransform-like normalisation (experimental).
annotations array or None None Cell-type labels. Adds DF.doublet.contributors_* columns.
scale_factor float 1e4 Target sum for normalisation.
n_top_genes int 2000 Number of HVGs for VST.
loess_span float 0.3 Span for loess in VST.
scale_max float 10 Clip value for ScaleData.
random_state int 0 PCA seed.

model_homotypic(annotations)

Estimates the proportion of homotypic doublets from cell type annotations. Returns sum(p_i^2) where p_i is the proportion of cell type i. Replicates R's modelHomotypic.

param_sweep_and_summarize(adata, PCs, ...)

Runs a pN–pK parameter sweep and returns a DataFrame with columns pN, pK, BCreal (bimodality coefficient). Select the pK that maximises BCreal.

Differences from R DoubletFinder

Aspect R Python
Normalisation NormalizeData (Seurat) sc.pp.normalize_total + sc.pp.log1p
HVG selection FindVariableFeatures(method="vst") Native reimplementation (_seurat_vst)
Scaling ScaleData (Seurat) sc.pp.scale
PCA irlba via RunPCA ARPACK via sc.tl.pca
Distance matrix fields::rdist scipy.spatial.distance.cdist
Loess (VST) stats::loess (degree=2) skmisc.loess (degree=2) or statsmodels.lowess fallback

Benchmarks

To reproduce the benchmark comparing this implementation against R DoubletFinder:

SAMPLE_H5=/path/to/sample.h5 bash benchmarks/benchmark.sh

Requires Docker. On first run, builds an image with R 4.4 + Seurat + Python (~10 min). Subsequent runs reuse the cached image.

Results are written to benchmarks/results/:

  • comparison_report.txt — full metrics summary
  • plots/pann_scatter.png — pANN correlation scatter
  • plots/pann_hist.png — pANN distribution overlay
  • plots/confusion.png — classification confusion matrix
  • plots/hvg_overlap.png — HVG overlap bar chart

Citation

If you use pyDoubletFinder in a publication, please cite both this package and the original DoubletFinder paper:

APA:

dam2452. (2026). pyDoubletFinder: Python port of the R DoubletFinder algorithm (Version 1.0.0). https://github.com/dam2452/pydoubletfinder

McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. (2019). DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Systems, 8, 329–337.e4. https://doi.org/10.1016/j.cels.2019.03.003

BibTeX:

@software{pydoubletfinder2026,
  title   = {pyDoubletFinder: Python port of the R DoubletFinder algorithm},
  author  = {dam2452},
  year    = {2026},
  version = {1.0.0},
  url     = {https://github.com/dam2452/pydoubletfinder}
}

@article{mcginnis2019doubletfinder,
  title     = {{DoubletFinder}: Doublet Detection in Single-Cell {RNA} Sequencing Data Using Artificial Nearest Neighbors},
  author    = {McGinnis, Christopher S. and Murrow, Lydia M. and Gartner, Zev J.},
  journal   = {Cell Systems},
  volume    = {8},
  number    = {4},
  pages     = {329--337.e4},
  year      = {2019},
  doi       = {10.1016/j.cels.2019.03.003}
}

Contributing

Contributions are welcome! Here's how you can help:

  1. Bug reports - Open an issue with a minimal reproducible example
  2. Feature requests - Open an issue describing the use case
  3. Code contributions - Fork, create a feature branch, and open a pull request

Development setup

git clone https://github.com/dam2452/pydoubletfinder.git
cd pydoubletfinder
pip install -e ".[dev]"
pytest tests/

License

This project is licensed under the MIT License - see LICENSE for full details.

Reference

McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Systems 8, 329–337.e4 (2019). https://doi.org/10.1016/j.cels.2019.03.003

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doubletfinder_py-1.1.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doubletfinder_py-1.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file doubletfinder_py-1.1.0.tar.gz.

File metadata

  • Download URL: doubletfinder_py-1.1.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for doubletfinder_py-1.1.0.tar.gz
Algorithm Hash digest
SHA256 6b3ab85a8769c7ef05ac8cb029022030d5be1636779eb4ecd9f0da75826972a0
MD5 80af54910a052fef6a258893711c4c5a
BLAKE2b-256 6882d18543e02ebc6b643f4c536afc06321597dcd28f18dd7332ba1aae03e353

See more details on using hashes here.

File details

Details for the file doubletfinder_py-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for doubletfinder_py-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 86e615339f0dadcbd22d0a892d1927d505dd12c27be3971ae8495fc4c90eb965
MD5 b7d385b5d5518a71e6188fc7f30dd1f7
BLAKE2b-256 d0ae71634cff76169e16def7ebc1ca5134f3fde105f8fea4c1ad1c015a24158b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page