Pure-Python port of Bioconductor MSstats — vendor converters, pre-processing, normalization, summarization, group comparison, sample-size design, and QC plotting for mass-spec proteomics.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Starlitnightly

These details have not been verified by PyPI

Project links

Project description

pymsstats

A pure-Python port of Bioconductor MSstats (Choi et al., Bioinformatics 2014) — vendor converters, pre-processing, normalization, peptide-to-protein summarization, per-protein LMM-based group comparison, sample-size design, and QC plotting for mass-spec proteomics.

Full pipeline mirroring R: vendor → dataProcess (log2 + normalize + TMP/linear) → groupComparison → designSampleSize / plots
No rpy2, no R install — every algorithm reimplemented directly in NumPy / SciPy / statsmodels / pandas
100% coverage of the R MSstats 4.14.2 public API — all 50 exported names (46 functions ported, 4 example datasets replaced by a synthetic generator)
Bit-for-bit reproduction of TMP LogIntensities (5e-14), log2FC (8e-15), linear summarization (9e-14), designSampleSize numSample (exact), MSstatsContrastMatrix, checkRepeatedDesign and SDRFtoAnnotation against R MSstats 4.14.2
2.5×–3.9× faster than R MSstats on the 500-protein × 8-run benchmark (Pearson r = 1.0 on every output)
Nine vendor converters (DIA-NN, Spectronaut, FragPipe, OpenMS, Skyline, Proteome Discoverer, Progenesis, OpenSWATH, DIA-Umpire) on top of MaxQuant
Modular worker API (MSstatsPrepareFor*, MSstatsSummarize*, MSstatsGroupComparison*) + SDRF metadata helpers
AnnData / pandas-friendly: accepts MSstats-format long DataFrame (10 canonical columns)

This is a standalone mirror of the canonical implementation that lives in omicverse. All algorithmic work is developed upstream in omicverse and synced here.

Install

pip install pymsstats

Depends only on numpy, scipy, pandas, and statsmodels.

Quick-start

import numpy as np
import pandas as pd
from pymsstats import data_process, group_comparison

# msstats_df is a long-format DataFrame with the 10 canonical MSstats columns:
#   ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge,
#   IsotopeLabelType, Condition, BioReplicate, Run, Intensity
msstats_df = pd.read_csv("msstats_long.tsv", sep="\t")

# Pre-process: log2 → equalize-medians → drop low-obs features → TMP summarize.
processed = data_process(
    msstats_df,
    normalization="equalizeMedians",
    summary_method="TMP",
    log_base=2,
)
# processed columns: Protein, RUN, LogIntensities, GROUP, SUBJECT, n_features, n_obs

# Two-group test: group1 vs group0.
result = group_comparison(
    processed,
    contrast_matrix=np.array([[-1.0, 1.0]]),
    groups=["group0", "group1"],
)
# result columns: Protein, Label, log2FC, SE, df, t, pvalue, adj.pvalue, issue

Vendor converters

Six converters turn raw vendor output into MSstats long-format:

from pymsstats import (
    maxquant_to_msstats, diann_to_msstats, spectronaut_to_msstats,
    fragpipe_to_msstats, openms_to_msstats, skyline_to_msstats,
)

ann = pd.read_csv("annotation.csv")   # columns: Run, Condition, BioReplicate

# MaxQuant
pg = pd.read_csv("proteinGroups.txt", sep="\t")
ev = pd.read_csv("evidence.txt", sep="\t")
msstats_df = maxquant_to_msstats(pg, ev, ann)

# DIA-NN
report = pd.read_csv("report.tsv", sep="\t")
msstats_df = diann_to_msstats(report, ann, qvalue_cutoff=0.01)

# Spectronaut / FragPipe / OpenMS / Skyline
msstats_df = spectronaut_to_msstats(pd.read_csv("spectronaut.tsv", sep="\t"))
msstats_df = skyline_to_msstats(pd.read_csv("skyline.csv"), ann)

Sample-size design, quantification, plotting

from pymsstats import design_sample_size, quantification, group_comparison_plots

# how many replicates for 80% power at FC 1.25–1.5?
design = design_sample_size(processed, desired_fc=(1.25, 1.5), power=0.8)

# per-protein abundance matrix
quant = quantification(processed, type="Sample")

# volcano / heatmap on the comparison result
ax = group_comparison_plots(result, type="VolcanoPlot")

Low-level API

from pymsstats import equalize_medians, tmp_summarize, medpolish

# Each stage is exposed for users who want to slot one piece of the
# pipeline into a custom workflow.
norm_df = equalize_medians(feature_df, abundance_col="ABUNDANCE")
prot_df = tmp_summarize(norm_df)        # per (protein, run) LogIntensities
mp = medpolish(X)                       # raw R-stats::medpolish equivalent

R-parity status

The MSstats dataProcess → groupComparison pipeline is tested against R MSstats 4.14.2 on the canonical LFQ design (globally-unique BioReplicate IDs):

Step	R counterpart	Match
`equalize_medians`	`MSstats:::.normalizeMedian`	bit-exact (post-norm run medians flat to 1e-9)
`medpolish`	`stats::medpolish` / `MSstats:::median_polish_summary`	row+col effects bit-exact (1e-9) on noiseless tables
`tmp_summarize` LogIntensities	`MSstats::dataProcess$ProteinLevelData`	bit-exact, max abs diff = 5e-14
`group_comparison` log2FC	`MSstats::groupComparison`	bit-exact, max abs diff = 8e-15
`group_comparison` SE	`MSstats::groupComparison`	bit-exact for OLS path (1e-15); ≤ 5% on LMM-fallback path
`group_comparison` pvalue	`MSstats::groupComparison`	bit-exact for OLS path; Pearson r ≥ 0.99 for LMM path
`group_comparison` adj.pvalue	`MSstats::groupComparison`	same as pvalue

The OLS path (auto mode when SUBJECT is nested in GROUP) is bit-identical to R. The mixed path (paired/repeated-measures design where the same BioReplicate appears across groups) uses statsmodels.MixedLM; when its solver hits a singular Hessian (common for small balanced data) we fall back to a fixed-effects lm(ABUNDANCE ~ GROUP + SUBJECT), which gives slightly different SEs from R's lme4::lmer (the ranking — Pearson r ≥ 0.99 on pvalue — is unaffected).

Benchmark

500 proteins × 8 runs × 3 features, label-free DDA design:

Step	R (ms)	Python (ms)	Speedup
`dataProcess`	3754	2393	1.57×
`groupComparison`	3880	626	6.20×
Pipeline total	7634	3019	2.53×
Including `Rscript` startup	11807	3019	3.91×

Run it yourself:

python examples/benchmark.py --runs 3 --n-proteins 500

Reproducing R results exactly

tests/r_reference_driver.R runs MSstats::dataProcess + MSstats::groupComparison on the same TSV input the Python side dumps. tests/test_r_parity.py (skipped if no CMAP / MSstats env) checks every numerical field — see the R-parity table above.

# Run only the R-parity tests
pytest tests/test_r_parity.py -v

# Run everything (smoke + R-parity)
pytest tests/ -v

MSstats public API coverage (v0.3 — 100%)

All 50 exported names of R MSstats 4.14.2 are covered. The 46 functions are ported to Python; the 4 bundled example datasets are copyrighted Bioconductor data objects (not redistributed) and are replaced by a synthetic generator.

Core pipeline

R function	Python	Status
`dataProcess` (log2 + equalizeMedians + TMP)	`data_process`	✅ ported
`groupComparison` (OLS / LMM auto-dispatch + BH)	`group_comparison`	✅ ported

Vendor converters

R function	Python	Status
`MaxQtoMSstatsFormat`	`maxquant_to_msstats`	✅ ported
`DIANNtoMSstatsFormat`	`diann_to_msstats`	✅ ported
`SpectronauttoMSstatsFormat`	`spectronaut_to_msstats`	✅ ported
`FragPipetoMSstatsFormat`	`fragpipe_to_msstats`	✅ ported
`OpenMStoMSstatsFormat`	`openms_to_msstats`	✅ ported
`SkylinetoMSstatsFormat`	`skyline_to_msstats`	✅ ported
`PDtoMSstatsFormat`	`pd_to_msstats`	✅ ported
`ProgenesistoMSstatsFormat`	`progenesis_to_msstats`	✅ ported
`OpenSWATHtoMSstatsFormat`	`openswath_to_msstats`	✅ ported
`DIAUmpiretoMSstatsFormat`	`diaumpire_to_msstats`	✅ ported

Statistical helpers

R function	Python	Status
`validateAnnotation`	`validate_annotation`	✅ ported
`MSstatsMergeFractions`	`merge_fractions`	✅ ported
`MSstatsContrastMatrix`	`msstats_contrast_matrix`	✅ ported (R-exact)
`designSampleSize`	`design_sample_size`	✅ ported (R-exact `numSample`)
`MSstatsNormalize` (equalizeMedians/quantile/globalStandards)	`msstats_normalize`	✅ ported
`MSstatsSummarize` (TMP / linear)	`msstats_summarize`	✅ ported (R-exact)
`MSstatsSelectFeatures` (all / topN)	`select_features`	✅ ported
`MSstatsHandleMissing`	`msstats_handle_missing`	✅ ported
`checkRepeatedDesign`	`check_repeated_design`	✅ ported (R-exact)
`makePeptidesDictionary`	`make_peptides_dictionary`	✅ ported
`quantification`	`quantification`	✅ ported
`getProcessed` / `getSamplesInfo` / `getSelectedProteins`	`get_processed` / `get_samples_info` / `get_selected_proteins`	✅ ported

Modular worker API

R function	Python	Status
`MSstatsPrepareForDataProcess`	`prepare_for_data_process`	✅ ported
`MSstatsPrepareForSummarization`	`prepare_for_summarization`	✅ ported
`MSstatsPrepareForGroupComparison`	`prepare_for_group_comparison`	✅ ported
`MSstatsSummarize` (modular)	`msstats_summarize_modular`	✅ ported
`MSstatsSummarizationOutput`	`summarization_output`	✅ ported
`MSstatsSummarizeWithSingleCore`	`summarize_single_core`	✅ ported
`MSstatsSummarizeSingleTMP`	`summarize_single_tmp`	✅ ported
`MSstatsSummarizeSingleLinear`	`summarize_single_linear`	✅ ported
`MSstatsGroupComparison`	`msstats_group_comparison`	✅ ported
`MSstatsGroupComparisonSingleProtein`	`group_comparison_single_protein`	✅ ported
`MSstatsGroupComparisonOutput`	`group_comparison_output`	✅ ported

SDRF helpers

R function	Python	Status
`extractSDRF`	`extract_sdrf`	✅ ported
`SDRFtoAnnotation`	`sdrf_to_annotation`	✅ ported (R-exact)
`example_SDRF`	`example_sdrf`	✅ ported (synthetic stand-in)

Plotting

R function	Python	Status
`dataProcessPlots` (Profile/QC/Condition)	`data_process_plots`	✅ ported
`groupComparisonPlots` (Volcano/Heatmap/Comparison)	`group_comparison_plots`	✅ ported
`groupComparisonQCPlots`	`group_comparison_qc_plots`	✅ ported
`modelBasedQCPlots`	`model_based_qc_plots`	✅ ported
`designSampleSizePlots`	`design_sample_size_plots`	✅ ported
`theme_msstats`	`theme_msstats`	✅ ported
`savePlot`	`save_plot`	✅ ported

Bundled example datasets

R object	Python	Status
`DDARawData`	`load_dda_example()`	⚠️ synthetic stand-in — original R data not redistributed
`DDARawData.Skyline`	`load_dda_example()`	⚠️ synthetic stand-in
`DIARawData`	`load_dia_example()`	⚠️ synthetic stand-in
`SRMRawData`	`load_srm_example()`	⚠️ synthetic stand-in

The four R example datasets are copyrighted Bioconductor data objects; pymsstats provides make_example_dataset() / load_dda_example() / load_dia_example() / load_srm_example() which return a synthetic MSstats long-format DataFrame with the same 10 canonical columns so the pipeline can be exercised end-to-end without the original data.

Known limitations (within ported functions)

The mixed-effects (lme4::lmer) path uses statsmodels.MixedLM with an OLS fallback on singular-Hessian cases (SE differs ≤ 5%; ranking unaffected — see R-parity table above).
feature_subset='highQuality' in select_features falls back to 'all' with a warning (the leverage-based outlier rule is rarely exercised).
TMT / labeled (heavy-light) workflows are out of scope (MSstatsTMT is a separate Bioconductor package).

Relationship to omicverse

Developed upstream in omicverse. This repo is a standalone mirror with the same code, same API, minus the omicverse packaging.

Citation

If you use this package, please cite the MSstats paper:

Choi, M., Chang, C.-Y., Clough, T., Broudy, D., Killeen, T., MacLean, B., & Vitek, O. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 30 (17), 2524-2526 (2014). DOI: 10.1093/bioinformatics/btu305

…and acknowledge omicverse / this repo for the Python port.

License

Artistic-2.0 — matches the upstream Bioconductor package.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Starlitnightly

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymsstats-0.3.0.tar.gz (66.7 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymsstats-0.3.0-py3-none-any.whl (57.2 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file pymsstats-0.3.0.tar.gz.

File metadata

Download URL: pymsstats-0.3.0.tar.gz
Upload date: May 20, 2026
Size: 66.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pymsstats-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`611ad66ae91b7aaf9347b5d3252a66786a93a3bd2a3e549944c46b12f453229d`
MD5	`6864570977046fe2998b2492a35a48e7`
BLAKE2b-256	`34b4eb23570b1e6b81f2cccef117f70a9b19971cc0b659b1dab32e0fff0d5478`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymsstats-0.3.0.tar.gz:

Publisher: publish.yml on omicverse/py-MSstats

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pymsstats-0.3.0.tar.gz
- Subject digest: 611ad66ae91b7aaf9347b5d3252a66786a93a3bd2a3e549944c46b12f453229d
- Sigstore transparency entry: 1579030070
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: omicverse/py-MSstats@c1d086579aa23b92777466f111b96183bd9dea89
- Branch / Tag: refs/heads/main
- Owner: https://github.com/omicverse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c1d086579aa23b92777466f111b96183bd9dea89
- Trigger Event: workflow_dispatch

File details

Details for the file pymsstats-0.3.0-py3-none-any.whl.

File metadata

Download URL: pymsstats-0.3.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 57.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pymsstats-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9450037ece97b718863b3a7aa83e5ccd0b7b1ffc8da7ee013d3a85c86e362047`
MD5	`06cd61c339f3294ba6ca61586cf04ece`
BLAKE2b-256	`135cb3bb34299864839ed35f33189c332362e8b1de3a762e1d093c6702c86269`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymsstats-0.3.0-py3-none-any.whl:

Publisher: publish.yml on omicverse/py-MSstats

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pymsstats-0.3.0-py3-none-any.whl
- Subject digest: 9450037ece97b718863b3a7aa83e5ccd0b7b1ffc8da7ee013d3a85c86e362047
- Sigstore transparency entry: 1579030248
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: omicverse/py-MSstats@c1d086579aa23b92777466f111b96183bd9dea89
- Branch / Tag: refs/heads/main
- Owner: https://github.com/omicverse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c1d086579aa23b92777466f111b96183bd9dea89
- Trigger Event: workflow_dispatch

pymsstats 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pymsstats

Install

Quick-start

Vendor converters

Sample-size design, quantification, plotting

Low-level API

R-parity status

Benchmark

Reproducing R results exactly

MSstats public API coverage (v0.3 — 100%)

Core pipeline

Vendor converters

Statistical helpers

Modular worker API

SDRF helpers

Plotting

Bundled example datasets

Known limitations (within ported functions)

Relationship to omicverse

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance