Pure-Python port of Bioconductor proDA — probabilistic dropout model for mass-spec proteomics differential abundance testing.
Project description
pyproda
A pure-Python port of Bioconductor proDA (Ahlmann-Eltze & Anders 2020) — a probabilistic dropout model for differential abundance analysis of mass-spec proteomics data with MNAR missing values.
- 100% R proDA function coverage — every exported R function is ported (see table below)
- Per-protein censored maximum-likelihood fit (
pd_lm) reimplemented in NumPy +scipy.optimize.L-BFGS-B - Probit dropout curve
P(observed | y) = invprobit((y - rho)/zeta)matched bit-exactly to RproDA::invprobit - Per-sample dropout-curve hyperparameter MLE (
rho_s,zeta_s) by Nelder-Mead optim - Empirical-Bayes
moderate_variance(Smyth-style σ² shrinkage) andmoderate_location(Cauchy β prior) n_subsamplechunked outer EM for large matrices- Wald test for any user-specified contrast (
test_diff); conveniencepd_row_t_test/pd_row_f_test - Probabilistic distance
dist_approx,predict,median_normalization, and the full S4 accessor set - No
rpy2, no R install — all algorithmic code is reimplemented directly - Wald
t_statisticandpvalagree with R proDA at Pearson r > 0.99 on the synthetic 300x12 benchmark - AnnData-friendly: accepts
np.ndarrayorpd.DataFrame(rows = proteins, columns = samples)
This is a standalone mirror of the canonical implementation that lives in
omicverse. All algorithmic work is developed upstream in omicverse and synced here.
Install
pip install pyproda
Quick-start
import numpy as np
import pandas as pd
from pyproda import proda, test_diff
# M: log2 protein intensity matrix (proteins x samples, NaN = missing)
M = pd.read_csv("proteinGroups_log2.tsv", sep="\t", index_col=0)
# Per-sample group labels (or a (n_samples, p) design matrix).
groups = ["control"] * 3 + ["treatment"] * 3
fit = proda(M, design=groups)
# Wald test: "is the treatment vs control coefficient != 0?"
res = test_diff(fit, np.array([0.0, 1.0]))
res.head()
# name pval adj_pval diff t_statistic se df avg_abundance n_obs
# prot_0042 3.2e-7 2.1e-4 2.13 8.4 0.25 10 18.91 6
# ...
Low-level API (mirrors R one-to-one)
from pyproda import (
invprobit, # proDA::invprobit
fit_one_protein, # proDA::pd_lm.fit
fit_dropout_curves, # proDA::dropout_curves
proda, # proDA::proDA
test_diff, # proDA::test_diff
pd_row_t_test, # proDA::pd_row_t_test
pd_row_f_test, # proDA::pd_row_f_test
dist_approx, # proDA::dist_approx
median_normalization, # proDA::median_normalization
predict, # proDA::predict
# S4 accessors:
abundances, coefficients, coefficient_variance_matrices,
convergence, design, feature_parameters, hyper_parameters,
reference_level, result_names,
)
p_obs = invprobit(y, rho=18.0, zeta=0.5) # probit sigmoid
hp = fit_dropout_curves(M, Pred) # rho_s, zeta_s
fit_i = fit_one_protein(M[0], X, rho=hp.rho, zeta=hp.zeta) # censored MLE
# R-style functional accessors on a fitted object:
fit = proda(M, design=groups)
coefficients(fit) # (n_proteins x p) DataFrame
feature_parameters(fit) # n_approx / df / s2 / n_obs DataFrame
hyper_parameters(fit) # rho / zeta + location & variance priors
predict(fit) # per-protein x sample linear predictor
dist_approx(fit) # probabilistic protein-protein distances
# Convenience two-/multi-group tests on raw matrices:
pd_row_t_test(X, Y) # row-wise censored t
pd_row_f_test(X, Y, Z) # multi-group F
pd_row_f_test(M, groups=["a","a","b","b"])
The ProDAFit container exposes every field the R proDAFit S4 object does — coefficients, coef_variance_matrices, s2, df, n_obs, hyper_parameters (with rho, zeta, location/variance priors), design, convergence. The same names are also available as module-level functions (coefficients(fit) etc.) to mirror the R generic-function API one-to-one.
R-parity status
| Step | R counterpart | Match |
|---|---|---|
invprobit(x, rho, zeta) |
proDA::invprobit |
bit-exact (atol 1e-12) |
inv_mills_ratio(...) |
proDA:::inv_mills_ratio |
bit-exact |
Per-protein coefficients (censored MLE) |
proDA:::pd_lm.fit |
Pearson r > 0.999 |
Per-protein s2, df (un-moderated) |
proDA:::calculate_sigma2_parameters |
within ~5% |
Per-sample (rho, zeta) |
proDA:::dropout_curves |
median |
Wald t_statistic |
proDA:::run_wald_parameter_test |
Pearson r > 0.99 |
Wald pval |
proDA::test_diff |
Pearson r > 0.99 |
pd_row_t_test t-statistic |
proDA::pd_row_t_test |
Pearson r = 0.9997 |
dist_approx (blind) |
proDA::dist_approx |
Pearson r = 0.99997 |
moderate_variance per-protein s2 |
proDA::proDA(moderate_variance=TRUE) |
Pearson r = 0.996 |
| Top-DE protein agreement | proDA::test_diff |
> 95% |
The remaining gap (relative to bit-exact) is dominated by the outer EM for the dropout-curve hyperparameters — R proDA runs up to 20 iterations of (rho, zeta) <-> coefs with both Pred and Pred_var updates; v0.1 runs 3-5 fixed-point iterations and only feeds Pred_var from the censored fit. The per-protein coefficient at fixed (rho, zeta) agrees with R to ~7 decimal places.
Benchmark
300 proteins x 12 samples, 12% MNAR-censored, 15% planted DE, 5 outer iterations:
| Wall time | |
|---|---|
| R proDA pipeline (no Rscript startup) | ~3.3 s |
| Python pipeline | ~3.4 s |
| R total (including Rscript startup) | ~9.1 s |
| Python total (no JVM/R startup) | ~3.4 s |
scipy.optimize.L-BFGS-B is well-matched to R's C-compiled nlminb per protein — neither has a clear edge on raw inner-loop speed. The end-to-end win is the absence of Rscript boot time (~5-6 s on the CMAP env).
Run it yourself:
python examples/benchmark.py --runs 2 --n-proteins 300 --n-per-group 6
Reproducing R results exactly
tests/r_reference_driver.R runs the R proDA pipeline (proDA(...) -> test_diff(...)) on the same TSV input the Python side dumps. tests/test_r_parity.py (skipped if no CMAP / proDA env) checks every numerical field — see the R-parity table above.
# Run the R-parity tests
pytest tests/test_r_parity.py -v
R function coverage
Every function exported by R proDA is now ported (pyproda 0.1.1):
| R function / feature | Status |
|---|---|
invprobit / inv_mills_ratio |
OK |
Per-protein censored MLE (pd_lm.fit) |
OK |
Per-sample dropout curves (dropout_curves) |
OK |
proDA() outer EM |
OK |
Wald test_diff(fit, contrast) |
OK |
moderate_variance (empirical-Bayes shrinkage on sigma^2) |
OK |
moderate_location (location prior on beta) |
OK |
n_subsample / chunked fitting for large matrices |
OK |
pd_row_t_test (two-group convenience test) |
OK |
pd_row_f_test (multi-group F-test) |
OK |
dist_approx(fit, by_sample=) |
OK |
predict(fit, type=) |
OK |
median_normalization(X) |
OK |
Accessors: abundances, coefficients, coefficient_variance_matrices, convergence, design, feature_parameters, hyper_parameters, reference_level, result_names |
OK |
generate_synthetic_data |
not ported (test-fixture helper; synthetic data is generated inline in tests/) |
Notes on approximations: the nested-model pd_row_f_test uses an
observed-sample SSR ratio rather than a full re-fit of the censored
likelihood under both models — this agrees with R to a few percent on
well-determined rows. moderate_variance applies post-hoc Smyth
shrinkage (s2_post = (df0 s0² + df s²)/(df0+df)) rather than embedding
the inverse-chi-square prior inside the per-protein objective, which is
why the σ² parity is r ≈ 0.996 rather than bit-exact.
Relationship to omicverse
Developed upstream in omicverse:
- Canonical implementation:
omicverse.protein.tl.de(adata, method='proda') - Standalone mirror (this repo): same code, same API, minus the omicverse packaging
Citation
If you use this package, please cite the original proDA paper:
Ahlmann-Eltze C., Anders S. proDA: Probabilistic Dropout Analysis for Identifying Differentially Abundant Proteins in Label-Free Quantitative Mass Spectrometry. bioRxiv (2020). DOI: 10.1101/661496
...and acknowledge omicverse / this repo for the Python port.
License
GPL-3 — matches the upstream Bioconductor package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyproda-0.1.1.tar.gz.
File metadata
- Download URL: pyproda-0.1.1.tar.gz
- Upload date:
- Size: 36.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d65a6247a42a71730986f9359279799d8a73a81aa9a56338e15f40ea78e52b18
|
|
| MD5 |
231b0bdbe0e6ffbc695be4c1496acad6
|
|
| BLAKE2b-256 |
206b9a4c63f8e3e9ca2255b6000a514b0018df8452e212fd8c6d4c68d48ab44d
|
File details
Details for the file pyproda-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pyproda-0.1.1-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06ca0ff9de6a2d0d98041a8d7d2cf628297e29331d1432a2fb084c6f5b16fe39
|
|
| MD5 |
fdcf393348c8b5cb6d42bfb99c6a11d9
|
|
| BLAKE2b-256 |
ed273c0c9cffbe494a348d2ef98aaa8acd72f789187ed02102abac39fe272b8d
|