Skip to main content

Minimal library for writing mzTab 1.0 proteomics files

Project description

mztabwriter

A minimal, dependency-free Python library for writing mzTab 1.0 proteomics files.

mzTab specification (1.0 Proteomics Release) · Format examples · Russian README


Features

  • Generates mzTab 1.0 files (proteomics mode)
  • No mandatory runtime dependencies — pure Python 3.10+
  • Supports both Complete and Summary modes
  • Supports both Quantification and Identification types
  • Handles label-free, iTRAQ, and SILAC experiments
  • Full metadata coverage: instruments, contacts, publications, samples, URIs
  • Optional pandas integration for bulk loading from DataFrames
  • to_string() and to_file() output methods

Installation

pip install mztabwriter

With optional pandas support:

pip install mztabwriter[pandas]

mzTab 1.0 File Structure (Proteomics)

An mzTab file consists of tab-separated sections, each identified by a row-type prefix:

Prefix Section Description
MTD Metadata Experiment description, instruments, software, ms_runs, assays, modifications
PRH Protein Header Column names for the protein table
PRT Protein One row per identified protein
PSH PSM Header Column names for the PSM table
PSM PSM One row per peptide-spectrum match
COM Comment Ignored by parsers, human-readable notes

MTD — Metadata (required)

Key metadata fields:

Key Description Example
mzTab-version Format version 1.0.0
mzTab-mode Complete or Summary Complete
mzTab-type Quantification or Identification Quantification
description Free-text experiment description
ms_run[N]-location URI of raw data file file:///data/run1.mzML
assay[N]-quantification_reagent CV param of label/reagent [MS, MS:1002038, unlabeled sample, ]
assay[N]-ms_run_ref Which ms_run this assay uses ms_run[1]
study_variable[N]-assay_refs Assays grouped by condition assay[1],assay[2]
study_variable[N]-description Condition description heat shock control
fixed_mod[N] Fixed search modification (UNIMOD CV) [UNIMOD, UNIMOD:4, Carbamidomethyl, ]
variable_mod[N] Variable search modification [UNIMOD, UNIMOD:35, Oxidation, ]
protein_search_engine_score[N] Score type for proteins [MS, MS:1001171, Mascot:score, ]
psm_search_engine_score[N] Score type for PSMs [MS, MS:1001171, Mascot:score, ]
quantification_method Quantification strategy [MS, MS:1001835, SILAC, ]

Optional:

Key Description
title Experiment title
mzTab-ID Repository identifier
instrument[N]-name/source/analyzer/detector MS instrument details
software[N] Analysis software
publication[N] pubmed:XXXXXXX or doi:...
contact[N]-name/affiliation/email Contact person
uri[N] Link to data repository
sample[N]-species/cell_type/disease/tissue Sample description

PRT — Protein rows

Each protein row contains:

Column Type Description
accession str Database identifier (e.g. P63017)
description str|null Protein description
taxid int|null NCBI Taxonomy ID
species str|null Species name
database str|null Database name (e.g. UniProtKB)
database_version str|null Database version
search_engine CvParam|null Search engine
best_search_engine_score[1] float|null Best score across all runs
search_engine_score[1]_ms_run[N] float|null Score per run
num_psms_ms_run[N] int|null Number of PSMs per run
num_peptides_distinct_ms_run[N] int|null Distinct peptides per run
num_peptides_unique_ms_run[N] int|null Unique peptides per run
ambiguity_members str|null Comma-separated accessions of ambiguity group
modifications str|null Detected modifications (e.g. 12-UNIMOD:35)
protein_coverage float|null Sequence coverage fraction (0.0–1.0)
protein_abundance_assay[N] float|null Abundance per assay
protein_abundance_study_variable[N] float|null Mean abundance per condition
protein_abundance_stdev_study_variable[N] float|null Std deviation per condition
protein_abundance_std_error_study_variable[N] float|null Std error per condition

PSM — Peptide-Spectrum Match rows

Column Type Description
sequence str Peptide amino acid sequence
PSM_ID int Unique PSM identifier within the file
accession str Protein accession
unique 0|1|null 1 if peptide is unique to this protein
database str|null Database name
database_version str|null Database version
search_engine CvParam|null Search engine
search_engine_score[1] float|null Score
modifications str|null Modifications (e.g. 0-UNIMOD:214, 9-UNIMOD:4)
spectra_ref str|null Spectrum reference, e.g. ms_run[1]:scan=1296
retention_time float|null Retention time in seconds
charge int|null Precursor charge state
exp_mass_to_charge float|null Experimental m/z
calc_mass_to_charge float|null Theoretical m/z
pre str|null Amino acid before the peptide N-terminus (- = protein N-term)
post str|null Amino acid after the peptide C-terminus
start int|null 1-based start position in protein
end int|null 1-based end position in protein

API Reference

CvParam(cv_label, accession, name, value="")

A Controlled Vocabulary parameter — the basic annotation unit in mzTab.

from mztabwriter import CvParam

CvParam("MS", "MS:1001207", "Mascot")
# → [MS, MS:1001207, Mascot, ]

CvParam("UNIMOD", "UNIMOD:4", "Carbamidomethyl")
# → [UNIMOD, UNIMOD:4, Carbamidomethyl, ]

CvParam("PRIDE", "PRIDE:0000131", "Instrument model", "Micromass Q-TOF I")
# → [PRIDE, PRIDE:0000131, Instrument model, Micromass Q-TOF I]

Modification(position, cv_accession)

A peptide/protein modification at a specific position.

from mztabwriter import Modification

Modification(0, "UNIMOD:214")    # → 0-UNIMOD:214
Modification(12, "UNIMOD:35")   # → 12-UNIMOD:35
Modification(None, "UNIMOD:4")  # → -UNIMOD:4

MzTabDocument(mode, type_, version, title, description, mztab_id)

The main document class.

Parameter Type Default Description
mode "Complete" | "Summary" "Complete" File mode
type_ "Quantification" | "Identification" "Quantification" Data type
version str "1.0.0" mzTab format version
title str | None None Experiment title
description str | None None Experiment description
mztab_id str | None None Repository ID

Metadata methods

Method Returns Description
add_ms_run(location, format=None, id_format=None) MsRun Add a raw data file reference
add_sample(description, species, cell_type, disease, tissue, custom) Sample Add sample description
add_assay(ms_run, quantification_reagent, sample=None, quantification_mods=None) Assay Add assay (run + label)
add_study_variable(description, assays) StudyVariable Group assays into a condition
set_quantification_method(cv) None Set experiment-level quantification method
set_protein_quantification_unit(cv) None Set abundance unit
add_software(cv) None Add analysis software
add_publication(ref) None Add pubmed:XXXXXXX or doi:...
add_contact(name, affiliation=None, email=None) None Add contact person
add_uri(uri) None Add data repository URI
add_instrument(name, source, analyzer, detector) None Add MS instrument description
add_fixed_mod(cv, site=None, position=None) SearchModification Add fixed search modification
add_variable_mod(cv, site=None, position=None) SearchModification Add variable search modification
add_protein_search_engine_score(cv) SearchEngineScore Register protein score type
add_psm_search_engine_score(cv) SearchEngineScore Register PSM score type

Data methods

Method Returns Description
add_protein(accession, ...) ProteinRow Add a protein row
add_psm(sequence, psm_id, accession, ...) PsmRow Add a PSM row
add_proteins_from_dataframe(df) None Bulk-load proteins from pandas DataFrame
add_psms_from_dataframe(df) None Bulk-load PSMs from pandas DataFrame

Output methods

Method Returns Description
to_string() str Return the complete mzTab document as a string
to_file(path) None Write the document to a file (UTF-8)

Examples

Label-free quantification (2 conditions × 3 replicates)

from mztabwriter import MzTabDocument, CvParam, Modification

doc = MzTabDocument(
    mode="Complete",
    type_="Quantification",
    title="LFQ heat shock experiment",
    description="Label-free quantification of heat shock proteins, 2 conditions",
)

# Raw data files
r1 = doc.add_ms_run("file:///data/ctrl_rep1.mzML")
r2 = doc.add_ms_run("file:///data/ctrl_rep2.mzML")
r3 = doc.add_ms_run("file:///data/ctrl_rep3.mzML")
r4 = doc.add_ms_run("file:///data/treat_rep1.mzML")
r5 = doc.add_ms_run("file:///data/treat_rep2.mzML")
r6 = doc.add_ms_run("file:///data/treat_rep3.mzML")

reagent = CvParam("MS", "MS:1002038", "unlabeled sample")
a1 = doc.add_assay(r1, reagent)
a2 = doc.add_assay(r2, reagent)
a3 = doc.add_assay(r3, reagent)
a4 = doc.add_assay(r4, reagent)
a5 = doc.add_assay(r5, reagent)
a6 = doc.add_assay(r6, reagent)

doc.add_study_variable("control", [a1, a2, a3])
doc.add_study_variable("heat shock treatment", [a4, a5, a6])

# Scores and modifications
doc.add_protein_search_engine_score(CvParam("MS", "MS:1001171", "Mascot:score"))
doc.add_psm_search_engine_score(CvParam("MS", "MS:1001171", "Mascot:score"))
doc.add_fixed_mod(CvParam("UNIMOD", "UNIMOD:4", "Carbamidomethyl"), site="C", position="Anywhere")
doc.add_variable_mod(CvParam("UNIMOD", "UNIMOD:35", "Oxidation"), site="M", position="Anywhere")
doc.set_quantification_method(CvParam("MS", "MS:1002038", "unlabeled sample"))
doc.set_protein_quantification_unit(CvParam("PRIDE", "PRIDE:0000393", "Relative quantification unit"))

# Proteins
doc.add_protein(
    accession="P63017",
    description="Heat shock cognate 71 kDa protein",
    taxid=10090,
    species="Mus musculus",
    database="UniProtKB",
    database_version="2013_08",
    search_engine=CvParam("MS", "MS:1001207", "Mascot"),
    best_search_engine_score=46.0,
    search_engine_scores={"ms_run[1]": 46, "ms_run[2]": 26, "ms_run[3]": 36,
                          "ms_run[4]": -3, "ms_run[5]": -1, "ms_run[6]": None},
    num_psms={"ms_run[1]": 1, "ms_run[2]": 1, "ms_run[3]": 1,
              "ms_run[4]": 1, "ms_run[5]": 1, "ms_run[6]": 0},
    num_peptides_distinct={"ms_run[1]": 1, "ms_run[2]": 1, "ms_run[3]": 1,
                           "ms_run[4]": 1, "ms_run[5]": 1, "ms_run[6]": 0},
    num_peptides_unique={"ms_run[1]": 1, "ms_run[2]": 1, "ms_run[3]": 1,
                         "ms_run[4]": 1, "ms_run[5]": 1, "ms_run[6]": 0},
    protein_coverage=0.34,
    protein_abundance_assay={
        "assay[1]": 34.3, "assay[2]": 40.4, "assay[3]": 41.1,
        "assay[4]": 267.0, "assay[5]": 234.4, "assay[6]": 271.0,
    },
    protein_abundance_study_variable={"study_variable[1]": 38.6, "study_variable[2]": 257.5},
    protein_abundance_stdev_study_variable={"study_variable[1]": 3.8, "study_variable[2]": 20.1},
    protein_abundance_std_error_study_variable={"study_variable[1]": 2.2, "study_variable[2]": 11.6},
)

# PSMs
doc.add_psm(
    sequence="QTQTFTTYSDNQPGVL",
    psm_id=1,
    accession="P63017",
    unique=1,
    database="UniProtKB",
    database_version="2013_08",
    search_engine=CvParam("MS", "MS:1001207", "Mascot"),
    search_engine_score=46.0,
    modifications=[Modification(0, "UNIMOD:214")],
    spectra_ref="ms_run[1]:scan=1296",
    retention_time=1336.62,
    charge=3,
    exp_mass_to_charge=600.6218923,
    calc_mass_to_charge=600.6197,
    pre="K",
    post="I",
    start=424,
    end=439,
)

print(doc.to_string())
doc.to_file("lfq_experiment.mzTab")

iTRAQ quantification

from mztabwriter import MzTabDocument, CvParam

doc = MzTabDocument(mode="Complete", type_="Quantification")

run = doc.add_ms_run("file:///data/itraq_run1.mzML")

a1 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000114", "iTRAQ reagent 114"))
a2 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000115", "iTRAQ reagent 115"))
a3 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000116", "iTRAQ reagent 116"))
a4 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000117", "iTRAQ reagent 117"))

doc.add_study_variable("t=0", [a1])
doc.add_study_variable("t=1", [a2])
doc.add_study_variable("t=2", [a3])
doc.add_study_variable("t=3", [a4])

doc.set_quantification_method(CvParam("PRIDE", "PRIDE:0000313", "iTRAQ"))
doc.add_fixed_mod(CvParam("UNIMOD", "UNIMOD:214", "iTRAQ4plex"), site="K", position="Anywhere")
doc.add_fixed_mod(CvParam("UNIMOD", "UNIMOD:214", "iTRAQ4plex"), site="N-term", position="Any N-term")

SILAC quantification

from mztabwriter import MzTabDocument, CvParam

doc = MzTabDocument(mode="Complete", type_="Quantification")

run = doc.add_ms_run("file:///data/silac.mzML")
light = CvParam("PRIDE", "PRIDE:0000326", "SILAC light")
heavy = CvParam("PRIDE", "PRIDE:0000325", "SILAC heavy")

heavy_mods = [
    CvParam("UNIMOD", "UNIMOD:267", "Label:13C(6)15N(4)"),
    CvParam("UNIMOD", "UNIMOD:259", "Label:13C(6)15N(2)"),
]
a_light = doc.add_assay(run, light)
a_heavy = doc.add_assay(run, heavy, quantification_mods=heavy_mods)

doc.add_study_variable("control", [a_light])
doc.add_study_variable("treatment", [a_heavy])
doc.set_quantification_method(CvParam("MS", "MS:1001835", "SILAC"))

Loading from pandas DataFrame

import pandas as pd
from mztabwriter import MzTabDocument, CvParam

doc = MzTabDocument(mode="Complete", type_="Quantification")
# ... (add ms_runs, assays, study_variables, scores first) ...

df_proteins = pd.DataFrame([
    {
        "accession": "P63017",
        "description": "Heat shock cognate 71 kDa protein",
        "taxid": 10090,
        "species": "Mus musculus",
        "database": "UniProtKB",
        "database_version": "2013_08",
        "search_engine": CvParam("MS", "MS:1001207", "Mascot"),
        "best_search_engine_score": 46.0,
        "protein_coverage": 0.34,
        "protein_abundance_assay[1]": 34.3,
        "protein_abundance_assay[2]": 266.9,
        "protein_abundance_study_variable[1]": 34.3,
        "protein_abundance_study_variable[2]": 266.9,
        "protein_abundance_stdev_study_variable[1]": 3.8,
        "protein_abundance_stdev_study_variable[2]": 20.1,
        "protein_abundance_std_error_study_variable[1]": 2.2,
        "protein_abundance_std_error_study_variable[2]": 11.6,
    },
])

doc.add_proteins_from_dataframe(df_proteins)
doc.to_file("output.mzTab")

File Structure Summary

MTD   mzTab-version   1.0.0
MTD   mzTab-mode      Complete
MTD   mzTab-type      Quantification
MTD   description     ...
MTD   ms_run[1]-location   file:///data/run1.mzML
MTD   assay[1]-quantification_reagent   [MS, MS:1002038, unlabeled sample, ]
MTD   assay[1]-ms_run_ref   ms_run[1]
MTD   study_variable[1]-assay_refs   assay[1],assay[2],assay[3]
MTD   study_variable[1]-description  control
...
PRH   accession   description   ...   protein_abundance_assay[1]   ...
PRT   P63017      Heat shock…   ...   34.3                         ...
...
PSH   sequence   PSM_ID   accession   ...   spectra_ref   ...
PSM   QTQTFTT…   1        P63017      ...   ms_run[1]:scan=1296   ...

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mztabwriter-0.1.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mztabwriter-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file mztabwriter-0.1.0.tar.gz.

File metadata

  • Download URL: mztabwriter-0.1.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for mztabwriter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2be8dc53b2e6d76243087c572b2bda33f981b9b15bbf462d05ffc8294850b1bc
MD5 3e23d8f17d673c7a120acc0608464d1e
BLAKE2b-256 17cdadbb6d2dde7f375eee811a300cca8fd2d27c22e37da57320ef8eaea56072

See more details on using hashes here.

File details

Details for the file mztabwriter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mztabwriter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for mztabwriter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6bfeed85e8abfb7f495fe4ae50e4af00d98eb02c6da87f730acfd3a7905ce2a
MD5 733dd87db3206db3871ba4b94761e782
BLAKE2b-256 a0ff0551236f5b06c1a804780ffc4dcfd18680197412482a25c94a2af0836468

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page