Python implementation of SoupX for removing ambient RNA contamination from droplet-based single-cell RNA sequencing data
Project description
SoupX Python
A Python implementation of SoupX for removing ambient RNA contamination from droplet-based single-cell RNA sequencing data.
Overview
Droplet-based single-cell RNA sequencing (scRNA-seq) experiments contain ambient RNA contamination from cell-free mRNAs present in the input solution. This "soup" of background contamination can significantly confound biological interpretation, particularly in complex tissues where contamination rates can exceed 20%.
SoupX addresses this by:
- Estimating the ambient RNA expression profile from empty droplets
- Quantifying contamination fraction in each cell using marker genes
- Correcting cell expression profiles by removing estimated background
This Python implementation maintains full compatibility with the original R package interface while integrating seamlessly with the Python scRNA-seq ecosystem (scanpy, anndata).
Background & Citation
This implementation is based on the method described in:
Young, M.D., Behjati, S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience 9, giaa151 (2020). https://doi.org/10.1093/gigascience/giaa151
Please cite the original paper if you use this implementation in your research.
Installation
From PyPI (Recommended)
pip install soupx-python
From Source
git clone https://github.com/yourusername/soupx-python.git
cd soupx-python
pip install -e .
Dependencies
- Python ≥3.8
- numpy ≥1.19.0
- pandas ≥1.2.0
- scipy ≥1.6.0
- statsmodels ≥0.12.0
- scanpy ≥1.7.0 (optional, for integration examples)
Quick Start
Basic Usage (R-compatible interface)
import soupx
# Load 10X data (cellranger output directory)
sc = soupx.load10X("path/to/cellranger/outs/")
# Automatically estimate contamination
sc = soupx.autoEstCont(sc)
# Generate corrected count matrix
corrected_counts = soupx.adjustCounts(sc)
Integration with scanpy
import scanpy as sc
import soupx
import pandas as pd
# Load raw 10X data with both filtered and raw counts
adata_raw = sc.read_10x_mtx("path/to/raw_feature_bc_matrix/", cache=True)
adata_filtered = sc.read_10x_mtx("path/to/filtered_feature_bc_matrix/", cache=True)
# Create SoupChannel
soup_channel = soupx.SoupChannel(
tod=adata_raw.X.T.tocsr(), # raw counts (genes × droplets)
toc=adata_filtered.X.T.tocsr(), # filtered counts (genes × cells)
metaData=pd.DataFrame(index=adata_filtered.obs_names)
)
# Add clustering information (essential for good results)
sc.tl.leiden(adata_filtered, resolution=0.5)
soup_channel.setClusters(adata_filtered.obs['leiden'].values)
# Estimate and remove contamination
soup_channel = soupx.autoEstCont(soup_channel, verbose=True)
corrected_matrix = soupx.adjustCounts(soup_channel)
# Replace counts in AnnData object
adata_corrected = adata_filtered.copy()
adata_corrected.X = corrected_matrix.T # Convert back to cells × genes
# Continue with standard scanpy workflow
sc.pp.highly_variable_genes(adata_corrected)
sc.tl.pca(adata_corrected)
# ... further analysis
Advanced Usage
Manual Contamination Estimation
For experiments where automatic estimation fails or when you have prior biological knowledge:
# Manually specify contamination fraction
soup_channel.set_contamination_fraction(0.10) # 10% contamination
# Or use specific marker genes (e.g., hemoglobin genes for tissue samples)
hemoglobin_genes = ['HBA1', 'HBA2', 'HBB', 'HBD', 'HBG1', 'HBG2']
non_expressing = soupx.estimateNonExpressingCells(
soup_channel,
hemoglobin_genes,
clusters=soup_channel.metaData['clusters'].values
)
# Calculate contamination using marker genes
soup_channel = soupx.calculateContaminationFraction(
soup_channel,
{'HB': hemoglobin_genes},
non_expressing
)
Method Selection
# Different correction methods available:
# 1. Subtraction (default, fastest)
corrected = soupx.adjustCounts(soup_channel, method="subtraction")
# 2. Multinomial (most accurate, slower)
corrected = soupx.adjustCounts(soup_channel, method="multinomial")
# 3. SoupOnly (removes only confidently contaminated genes)
corrected = soupx.adjustCounts(soup_channel, method="soupOnly")
# Round to integers (some downstream tools require this)
corrected = soupx.adjustCounts(soup_channel, roundToInt=True)
API Reference
Core Classes
SoupChannel
Main container for scRNA-seq data and contamination analysis.
Parameters:
tod: Raw count matrix (genes × droplets, sparse)toc: Filtered count matrix (genes × cells, sparse)metaData: Cell metadata DataFramecalcSoupProfile: Whether to estimate soup profile automatically (default: True)
Key Methods
autoEstCont(sc, **kwargs)
Automatically estimate contamination fraction using marker genes.
Parameters:
tfidfMin: Minimum tf-idf for marker genes (default: 1.0)soupQuantile: Quantile threshold for soup genes (default: 0.9)verbose: Print progress information (default: True)
adjustCounts(sc, **kwargs)
Remove contamination and return corrected count matrix.
Parameters:
method: Correction method ("subtraction", "multinomial", "soupOnly")roundToInt: Round results to integers (default: False)clusters: Cluster assignments (improves accuracy)
Utility Functions
load10X(dataDir)
Load 10X CellRanger output directory.
quickMarkers(toc, clusters, N=10)
Identify cluster marker genes using tf-idf.
Validation & Benchmarking
This implementation has been validated against the original R version using:
- Species-mixing experiments: Cross-species contamination quantification
- PBMC datasets: Standard benchmark with known marker genes
- Complex tissue samples: Kidney tumor and fetal liver data
Key validation results:
- Contamination estimates: R² > 0.95 correlation with R implementation
- Correction accuracy: >90% reduction in cross-species contamination
- Marker gene specificity: Consistent improvement in fold-change ratios
Performance Considerations
- Memory usage: Sparse matrices used throughout to minimize memory footprint
- Clustering improves results: Always provide cluster information when possible
- Method selection: Use "subtraction" for speed, "multinomial" for accuracy
- Large datasets: Consider using
method="soupOnly"for >100k cells
Troubleshooting
Common Issues
Low marker gene detection:
# Reduce stringency for marker detection
sc = soupx.autoEstCont(sc, tfidfMin=0.5, soupQuantile=0.8)
High contamination estimates (>50%):
# Force acceptance of high contamination or manually set
sc.set_contamination_fraction(0.20, forceAccept=True)
No clustering information:
# SoupX works without clustering but results are less accurate
corrected = soupx.adjustCounts(sc, clusters=False)
Comparison with Other Methods
| Method | Speed | Accuracy | Requires Empty Droplets | Requires Clustering |
|---|---|---|---|---|
| SoupX | Fast | High | Yes | Recommended |
| CellBender | Slow | High | No | No |
| DecontX | Medium | Medium | No | Yes |
Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Setup
git clone https://github.com/yourusername/soupx-python.git
cd soupx-python
pip install -e ".[dev]"
pytest tests/
License
This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.
Changelog
v0.3.0 (Current)
- Full R compatibility
- Automated contamination estimation
- Integration with scanpy ecosystem
- Comprehensive validation suite
v0.2.0
- Core correction algorithms
- Manual contamination setting
- Basic 10X data loading
v0.1.0
- Initial implementation
- Basic SoupChannel functionality
Support
- Issues: GitHub Issues
- Questions: GitHub Discussions
- Citation: Please cite the original SoupX paper (Young & Behjati, 2020)
Acknowledgments
- Original SoupX developers: Matthew D. Young and Sam Behjati
- R package maintainers and contributors
- Python single-cell community (scanpy, anndata developers)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file soupx_python-0.3.0.tar.gz.
File metadata
- Download URL: soupx_python-0.3.0.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
585f263e5b5d53591c00dc1034563f68d1c82d3c47b683b059169f376b56c6e0
|
|
| MD5 |
a45d22e31e384fabac85e4d2ee06f7e8
|
|
| BLAKE2b-256 |
e137eae4c584d84ec542423d4b3860f535504673524ec0fd4801c89cfbcc488b
|
File details
Details for the file soupx_python-0.3.0-py3-none-any.whl.
File metadata
- Download URL: soupx_python-0.3.0-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a936abeec88f7b2605c59c1f32d6728443817faadd667dd6645be1980a65a6f2
|
|
| MD5 |
a0d0909e7578b01711351ede02471d22
|
|
| BLAKE2b-256 |
3863b21fee05ab88861ea2db29aa539ae516c55ff1773defd830ddf71b8ad22d
|