Python implementation of SoupX for removing ambient RNA contamination from droplet-based single-cell RNA sequencing data

These details have not been verified by PyPI

Project links

Project description

SoupX Python

A Python implementation of SoupX for removing ambient RNA contamination from droplet-based single-cell RNA sequencing data.

Overview

Droplet-based single-cell RNA sequencing (scRNA-seq) experiments contain ambient RNA contamination from cell-free mRNAs present in the input solution. This "soup" of background contamination can significantly confound biological interpretation, particularly in complex tissues where contamination rates can exceed 20%.

SoupX addresses this by:

Estimating the ambient RNA expression profile from empty droplets
Quantifying contamination fraction in each cell using marker genes
Correcting cell expression profiles by removing estimated background

This Python implementation maintains full compatibility with the original R package interface while integrating seamlessly with the Python scRNA-seq ecosystem (scanpy, anndata).

Background & Citation

This implementation is based on the method described in:

Young, M.D., Behjati, S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience 9, giaa151 (2020). https://doi.org/10.1093/gigascience/giaa151

Please cite the original paper if you use this implementation in your research.

Installation

From PyPI (Recommended)

pip install soupx-python

From Source

git clone https://github.com/yourusername/soupx-python.git
cd soupx-python
pip install -e .

Dependencies

Python ≥3.8
numpy ≥1.19.0
pandas ≥1.2.0
scipy ≥1.6.0
statsmodels ≥0.12.0
scanpy ≥1.7.0 (optional, for integration examples)

Quick Start

Basic Usage (R-compatible interface)

import soupx

# Load 10X data (cellranger output directory)
sc = soupx.load10X("path/to/cellranger/outs/")

# Automatically estimate contamination
sc = soupx.autoEstCont(sc)

# Generate corrected count matrix
corrected_counts = soupx.adjustCounts(sc)

Integration with scanpy

import scanpy as sc
import soupx
import pandas as pd

# Load raw 10X data with both filtered and raw counts
adata_raw = sc.read_10x_mtx("path/to/raw_feature_bc_matrix/", cache=True)
adata_filtered = sc.read_10x_mtx("path/to/filtered_feature_bc_matrix/", cache=True)

# Create SoupChannel
soup_channel = soupx.SoupChannel(
    tod=adata_raw.X.T.tocsr(),    # raw counts (genes × droplets)
    toc=adata_filtered.X.T.tocsr(), # filtered counts (genes × cells)
    metaData=pd.DataFrame(index=adata_filtered.obs_names)
)

# Add clustering information (essential for good results)
sc.tl.leiden(adata_filtered, resolution=0.5)
soup_channel.setClusters(adata_filtered.obs['leiden'].values)

# Estimate and remove contamination
soup_channel = soupx.autoEstCont(soup_channel, verbose=True)
corrected_matrix = soupx.adjustCounts(soup_channel)

# Replace counts in AnnData object
adata_corrected = adata_filtered.copy()
adata_corrected.X = corrected_matrix.T  # Convert back to cells × genes

# Continue with standard scanpy workflow
sc.pp.highly_variable_genes(adata_corrected)
sc.tl.pca(adata_corrected)
# ... further analysis

Advanced Usage

Manual Contamination Estimation

For experiments where automatic estimation fails or when you have prior biological knowledge:

# Manually specify contamination fraction
soup_channel.set_contamination_fraction(0.10)  # 10% contamination

# Or use specific marker genes (e.g., hemoglobin genes for tissue samples)
hemoglobin_genes = ['HBA1', 'HBA2', 'HBB', 'HBD', 'HBG1', 'HBG2']
non_expressing = soupx.estimateNonExpressingCells(
    soup_channel, 
    hemoglobin_genes,
    clusters=soup_channel.metaData['clusters'].values
)

# Calculate contamination using marker genes
soup_channel = soupx.calculateContaminationFraction(
    soup_channel, 
    {'HB': hemoglobin_genes}, 
    non_expressing
)

Method Selection

# Different correction methods available:

# 1. Subtraction (default, fastest)
corrected = soupx.adjustCounts(soup_channel, method="subtraction")

# 2. Multinomial (most accurate, slower)
corrected = soupx.adjustCounts(soup_channel, method="multinomial")

# 3. SoupOnly (removes only confidently contaminated genes)
corrected = soupx.adjustCounts(soup_channel, method="soupOnly")

# Round to integers (some downstream tools require this)
corrected = soupx.adjustCounts(soup_channel, roundToInt=True)

API Reference

Core Classes

`SoupChannel`

Main container for scRNA-seq data and contamination analysis.

Parameters:

tod: Raw count matrix (genes × droplets, sparse)
toc: Filtered count matrix (genes × cells, sparse)
metaData: Cell metadata DataFrame
calcSoupProfile: Whether to estimate soup profile automatically (default: True)

Key Methods

`autoEstCont(sc, **kwargs)`

Automatically estimate contamination fraction using marker genes.

Parameters:

tfidfMin: Minimum tf-idf for marker genes (default: 1.0)
soupQuantile: Quantile threshold for soup genes (default: 0.9)
verbose: Print progress information (default: True)

`adjustCounts(sc, **kwargs)`

Remove contamination and return corrected count matrix.

Parameters:

method: Correction method ("subtraction", "multinomial", "soupOnly")
roundToInt: Round results to integers (default: False)
clusters: Cluster assignments (improves accuracy)

Utility Functions

`load10X(dataDir)`

Load 10X CellRanger output directory.

`quickMarkers(toc, clusters, N=10)`

Identify cluster marker genes using tf-idf.

Validation & Benchmarking

This implementation has been validated against the original R version using:

Species-mixing experiments: Cross-species contamination quantification
PBMC datasets: Standard benchmark with known marker genes
Complex tissue samples: Kidney tumor and fetal liver data

Key validation results:

Contamination estimates: R² > 0.95 correlation with R implementation
Correction accuracy: >90% reduction in cross-species contamination
Marker gene specificity: Consistent improvement in fold-change ratios

Performance Considerations

Memory usage: Sparse matrices used throughout to minimize memory footprint
Clustering improves results: Always provide cluster information when possible
Method selection: Use "subtraction" for speed, "multinomial" for accuracy
Large datasets: Consider using method="soupOnly" for >100k cells

Troubleshooting

Common Issues

Low marker gene detection:

# Reduce stringency for marker detection
sc = soupx.autoEstCont(sc, tfidfMin=0.5, soupQuantile=0.8)

High contamination estimates (>50%):

# Force acceptance of high contamination or manually set
sc.set_contamination_fraction(0.20, forceAccept=True)

No clustering information:

# SoupX works without clustering but results are less accurate
corrected = soupx.adjustCounts(sc, clusters=False)

Comparison with Other Methods

Method	Speed	Accuracy	Requires Empty Droplets	Requires Clustering
SoupX	Fast	High	Yes	Recommended
CellBender	Slow	High	No	No
DecontX	Medium	Medium	No	Yes

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/yourusername/soupx-python.git
cd soupx-python
pip install -e ".[dev]"
pytest tests/

License

This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.

Changelog

v0.3.0 (Current)

Full R compatibility
Automated contamination estimation
Integration with scanpy ecosystem
Comprehensive validation suite

v0.2.0

Core correction algorithms
Manual contamination setting
Basic 10X data loading

v0.1.0

Initial implementation
Basic SoupChannel functionality

Support

Issues: GitHub Issues
Questions: GitHub Discussions
Citation: Please cite the original SoupX paper (Young & Behjati, 2020)

Acknowledgments

Original SoupX developers: Matthew D. Young and Sam Behjati
R package maintainers and contributors
Python single-cell community (scanpy, anndata developers)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Sep 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soupx_python-0.3.0.tar.gz (22.7 kB view details)

Uploaded Sep 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

soupx_python-0.3.0-py3-none-any.whl (21.1 kB view details)

Uploaded Sep 3, 2025 Python 3

File details

Details for the file soupx_python-0.3.0.tar.gz.

File metadata

Download URL: soupx_python-0.3.0.tar.gz
Upload date: Sep 3, 2025
Size: 22.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for soupx_python-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`585f263e5b5d53591c00dc1034563f68d1c82d3c47b683b059169f376b56c6e0`
MD5	`a45d22e31e384fabac85e4d2ee06f7e8`
BLAKE2b-256	`e137eae4c584d84ec542423d4b3860f535504673524ec0fd4801c89cfbcc488b`

See more details on using hashes here.

File details

Details for the file soupx_python-0.3.0-py3-none-any.whl.

File metadata

Download URL: soupx_python-0.3.0-py3-none-any.whl
Upload date: Sep 3, 2025
Size: 21.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for soupx_python-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a936abeec88f7b2605c59c1f32d6728443817faadd667dd6645be1980a65a6f2`
MD5	`a0d0909e7578b01711351ede02471d22`
BLAKE2b-256	`3863b21fee05ab88861ea2db29aa539ae516c55ff1773defd830ddf71b8ad22d`

See more details on using hashes here.

soupx-python 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SoupX Python

Overview

Background & Citation

Installation

From PyPI (Recommended)

From Source

Dependencies

Quick Start

Basic Usage (R-compatible interface)

Integration with scanpy

Advanced Usage

Manual Contamination Estimation

Method Selection

API Reference

Core Classes

SoupChannel

Key Methods

autoEstCont(sc, **kwargs)

adjustCounts(sc, **kwargs)

Utility Functions

load10X(dataDir)

quickMarkers(toc, clusters, N=10)

Validation & Benchmarking

Performance Considerations

Troubleshooting

Common Issues

Comparison with Other Methods

Contributing

Development Setup

License

Changelog

v0.3.0 (Current)

v0.2.0

v0.1.0

Support

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SoupChannel`

`autoEstCont(sc, **kwargs)`

`adjustCounts(sc, **kwargs)`

`load10X(dataDir)`

`quickMarkers(toc, clusters, N=10)`