Skip to main content

Pure-Python port of the R package SoupX — removal of ambient (soup) RNA contamination from droplet-based single-cell RNA-seq data.

Project description

pysoupx

A pure-Python re-implementation of SoupX (Young & Behjati, GigaScience 2020, 9(12):giaa151) for removing ambient ("soup") mRNA contamination from droplet-based single-cell RNA-seq data.

  • AnnData-native — drop-in for the scanpy ecosystem (load_10x, SoupChannel.from_anndata, to_anndata)
  • No rpy2, no R install — soup-profile estimation, the tf-idf marker search, the autoEstCont posterior, and the constrained adjustCounts subtraction are all implemented directly in NumPy/SciPy
  • Same function surface as the R workflow (estimateSoupsetClustersautoEstContadjustCounts)
  • Bit-for-bit reproducibility against the R reference on the deterministic kernels (see tests/test_r_parity.py)

This is a standalone mirror of the canonical implementation that lives in omicverse. All algorithmic work is developed upstream in omicverse and synced here for users who want SoupX without the full omicverse stack.

Install

pip install pysoupx

Dependencies: numpy, scipy, pandas, anndata, statsmodels (and matplotlib for the optional diagnostic plot).

Quick-start

SoupX needs the raw unfiltered droplet matrix — the soup profile is estimated from the empty droplets — plus the filtered cell matrix.

import pysoupx as soup

# --- from a 10x CellRanger output folder -------------------------
sc = soup.load_10x("path/to/cellranger/outs")   # raw + filtered

# --- or from AnnData objects -------------------------------------
# filtered = cells x genes ; raw = droplets x genes
sc = soup.SoupChannel.from_anndata(filtered, raw=raw, cluster_key="leiden")

# 1) soup profile is estimated automatically on construction
sc.soup_profile.head()

# 2) clusters + automatic contamination estimate
sc = soup.set_clusters(sc, cell_to_cluster)      # dict or sequence
sc = soup.auto_est_cont(sc)                      # sets meta_data['rho']

# 3) corrected count matrix (genes x cells, scipy sparse)
corrected = soup.adjust_counts(sc, round_to_int=True)

adata_corrected = soup.to_anndata(sc, corrected=corrected)

Low-level functional API (mirrors R one-to-one)

from pysoupx import (
    estimate_soup, set_soup_profile, set_clusters,
    set_contamination_fraction, quick_markers,
    estimate_non_expressing_cells, calculate_contamination_fraction,
    auto_est_cont, adjust_counts, alloc, expand_clusters,
)

# Manual contamination fraction instead of autoEstCont
sc = set_contamination_fraction(sc, 0.10)

# Estimate rho from a user-supplied non-expressed gene set
ute = estimate_non_expressing_cells(sc, gene_set)
calculate_contamination_fraction(sc, gene_set, ute)

What's included

Python R counterpart Purpose
SoupChannel / SoupChannel.from_anndata SoupChannel bundles droplets / counts / soup profile / metadata
estimate_soup estimateSoup per-gene soup fraction from empty droplets
set_soup_profile setSoupProfile set a soup profile manually
set_clusters setClusters attach a cell→cluster mapping
set_contamination_fraction setContaminationFraction set rho manually
quick_markers quickMarkers tf-idf cluster-marker genes
estimate_non_expressing_cells estimateNonExpressingCells which cells truly lack a gene set
calculate_contamination_fraction calculateContaminationFraction rho from non-expressed gene sets
auto_est_cont autoEstCont fully automatic rho estimate
adjust_counts adjustCounts soup-subtracted corrected matrix
alloc / expand_clusters alloc / expandClusters the constrained redistribution primitives
load_10x load10X read a 10x CellRanger folder
to_anndata / make_soup_channel (AnnData helpers) round-trip with the scanpy ecosystem
plot_contamination_fraction autoEstCont(doPlot=TRUE) diagnostic posterior plot

adjust_counts supports all three SoupX methods: subtraction (default), soupOnly and multinomial.

Reproducing R results exactly

SoupX's core kernels are deterministic, so feeding both ports identical raw + filtered matrices yields bit-for-bit agreement:

Quantity Result
Soup profile (estimateSoup) bit-exact (max abs diff ~1e-16)
quickMarkers tf-idf / qvals / idf bit-exact (max abs diff ~1e-16)
adjustCounts cluster-level, fixed rho bit-exact (max abs diff 0)
adjustCounts cell-level, fixed rho bit-exact (max abs diff ~1e-13)
autoEstCont rho exact (identical posterior mode)

tests/test_r_parity.py runs the R reference (r_reference_driver.R) inside the CMAP R env on the same synthetic raw + filtered matrices the Python side uses, and checks the soup profile, marker table, rho and corrected matrices match. examples/compare_R_vs_Python.ipynb does the same on the real SoupX toyData 10x dataset and visualises the agreement with omicverse.

The only intrinsically stochastic steps are the optional integer rounding (round_to_int) and the multinomial method's tie-breaking, both seedable via the seed argument.

Relationship to omicverse

Developed upstream in omicverse:

  • Canonical implementation lives in omicverse
  • Standalone mirror (this repo): same code, same API, minus the omicverse packaging

Citation

If you use this package, please cite the original SoupX paper:

Young, M.D. & Behjati, S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience 9, giaa151 (2020).

and acknowledge omicverse / this repo for the Python port.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysoupx-0.1.0.tar.gz (34.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysoupx-0.1.0-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file pysoupx-0.1.0.tar.gz.

File metadata

  • Download URL: pysoupx-0.1.0.tar.gz
  • Upload date:
  • Size: 34.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pysoupx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c85b551b6277396813a3781f4b093ddddb4aef3f2c769a30e6bf9a78b8178a45
MD5 065290bfe8da3414e874cd9479c1fdc7
BLAKE2b-256 b8ca468812c7eee73f1a7b73033c21591733147e1e04cbacccb198d04b63e51e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pysoupx-0.1.0.tar.gz:

Publisher: publish.yml on omicverse/py-soupx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pysoupx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pysoupx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pysoupx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f3f6b6a3625dc299ddd6f374dfeff8e0646a6d0dbb83bfd73806f0f7a24fb22b
MD5 d48f05f89f973d5f75f871bd99c2f376
BLAKE2b-256 a72b8f54960456761b8b64878ebb9ec2b0fa257755da3a0b1912f61b9f0f0dd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pysoupx-0.1.0-py3-none-any.whl:

Publisher: publish.yml on omicverse/py-soupx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page