MS²PIP: MS² Peak Intensity Prediction
Project description
MS²PIP: MS² Peak Intensity Prediction - Fast and accurate peptide fragmention spectrum prediction for multiple fragmentation methods, instruments and labeling techniques.
Introduction
MS²PIP is a tool to predict MS² signal peak intensities from peptide sequences. It employs the XGBoost machine learning algorithm and is written in Python.
You can install MS²PIP on your machine by following the instructions below. For a more user friendly experience, go to the MS²PIP web server. There, you can easily upload a list of peptide sequences, after which the corresponding predicted MS² spectra can be downloaded in multiple file formats. The web server can also be contacted through the RESTful API.
To generate a predicted spectral library starting from a FASTA file, we developed a pipeline called fasta2speclib. Usage of this pipeline is described on the fasta2speclib wiki page. Fasta2speclib was developed in collaboration with the ProGenTomics group for the MS²PIP for DIA project.
If you use MS²PIP for your research, please cite the following articles:
- Gabriels, R., Martens, L., & Degroeve, S. (2019). Updated MS²PIP web server delivers fast and accurate MS² peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques. Nucleic Acids Research doi:10.1093/nar/gkz299
- Degroeve, S., Maddelein, D., & Martens, L. (2015). MS²PIP prediction server: compute and visualize MS² peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Research, 43(W1), W326–W330. doi:10.1093/nar/gkv542
- Degroeve, S., & Martens, L. (2013). MS²PIP: a tool for MS/MS peak intensity prediction. Bioinformatics (Oxford, England), 29(24), 3199–203. doi:10.1093/bioinformatics/btt544
Please also take note of and mention the MS²PIP-version you used.
Installation
Install with pip
pip install ms2pip
We recommend using a conda or venv virtual environment.
For development
Clone the repository and use pip to install an editable version:
pip install --editable .
Usage
MS²PIP comes with pre-trained models for a variety of fragmentation methods and modifications. These models can easily be applied by configuring MS²PIP in the config file and providing a list of peptides in the form of a PEPREC file. Optionally, MS²PIP predictions can be compared to spectra in an MGF file.
Command line interface
usage: ms2pip [-h] -c CONFIG_FILE [-s MGF_FILE] [-w FEATURE_VECTOR_OUTPUT]
[-r] [-x] [-t] [-n NUM_CPU]
<PEPREC file>
positional arguments:
<PEPREC file> list of peptides
optional arguments:
-h, --help show this help message and exit
-c CONFIG_FILE, --config-file CONFIG_FILE
config file
-s MGF_FILE, --spectrum-file MGF_FILE
.mgf MS2 spectrum file (optional)
-w FEATURE_VECTOR_OUTPUT, --vector-file FEATURE_VECTOR_OUTPUT
write feature vectors to FILE.{pkl,h5} (optional)
-r, --retention-time add retention time predictions (requires DeepLC python
package)
-x, --correlations calculate correlations (if MGF is given)
-t, --tableau create Tableau Reader file
-n NUM_CPU, --num-cpu NUM_CPU
number of CPUs to use (default: all available)
Input files
Config file
Several MS²PIP options need to be set in this config file.
model=X
where X is one of the currently supported MS²PIP models (see Specialized prediction models).frag_error=X
where is X is the fragmentation spectrum mass tolerance in Da (only relevant if an MGF file is passed).out=X
where X is a comma-separated list of a selection of the currently supported output file formats:csv
,mgf
,msp
,spectronaut
, orbibliospec
(SSL/MS2, also for Skyline). For example:out=csv,msp
.ptm=X,Y,opt,Z
for every peptide modification where:X
is the PTM name and needs to match the names that are used in the PEPREC file). If the--retention_time
option is used, PTM names must match the PSI-MOD/Unimod names embedded in DeepLC (see DeepLC documentation).Y
is the mass shift in Da associated with the PTM.Z
is the one-letter code of the amino acid AA that is modified by the PTM. For N- and C-terminal modifications,Z
should beN-term
orC-term
, respectively.
PEPREC file
To apply the pre-trained models you need to pass only a <PEPREC file>
to
MS²PIP. This file contains the peptide sequences for which you want to predict
peak intensities. The file is space separated and contains at least the
following four columns:
spec_id
: unique id (string) for the peptide/spectrum. This must match the TITLE field in the corresponding MGF file, if given.modifications
: Amino acid modifications for the given peptide. Every modification is listed aslocation|name
, separated by a pipe (|
) between the location, the name, and other modifications.location
is an integer counted starting at1
for the first AA.0
is reserved for N-terminal modifications,-1
for C-terminal modifications.name
has to correspond to a modification listed in the Config file. Unmodified peptides are marked with a hyphen (-
).peptide
: the unmodified amino acid sequence.charge
: precursor charge state as an integer (without+
).
Peptides must be strictly longer than 2 and shorter than 100 amino acids and cannot contain the following amino acid one-letter codes: B, J, O, U, X or Z. Peptides not fulfilling these requirements will be filtered out and will not be reported in the output.
In the conversion_tools folder, we provide a host of Python scripts to convert common search engine output files to a PEPREC file.
To start from a FASTA file, see fasta2speclib.
MGF file (optional)
Optionally, an MGF file with measured spectra can be passed to MS²PIP. In this
case, MS²PIP will calculate correlations between the measured and predicted
peak intensities. Make sure that the PEPREC spec_id
matches the mgf TITLE
field. Spectra present in the MGF file, but missing in the PEPREC file (and
vice versa) will be skipped.
Examples
Suppose the config file contains the following lines
model=HCD
frag_error=0.02
out=csv,mgf,msp
ptm=Carbamidomethyl,57.02146,opt,C
ptm=Acetyl,42.010565,opt,N-term
ptm=Glyloss,-58.005479,opt,C-term
then the PEPREC file could look like this:
spec_id modifications peptide charge
peptide1 - ACDEK 2
peptide2 2|Carbamidomethyl ACDEFGR 3
peptide3 0|Acetyl|2|Carbamidomethyl ACDEFGHIK 2
In this example, peptide3
is N-terminally acetylated and carries a
carbamidomethyl on its second amino acid.
The corresponding (optional) MGF file can contain the following spectrum:
BEGIN IONS
TITLE=peptide1
PEPMASS=283.11849750978325
CHARGE=2+
72.04434967 0.00419513
147.11276245 0.17418982
175.05354309 0.03652963
...
END IONS
Output
The predictions are saved in the output file(s) specified in the
config file. Note that the normalization of intensities depends
on the output file format. In the CSV file output, intensities are
log2-transformed. To "unlog" the intensities, use the following formula:
intensity = (2 ** log2_intensity) - 0.001
.
Specialized prediction models
MS²PIP contains multiple specialized prediction models, fit for peptide spectra with different properties. These properties include fragmentation method, instrument, labeling techniques and modifications. As all of these properties can influence fragmentation patterns, it is important to match the MS²PIP model to the properties of your experimental dataset.
Currently the following models are supported in MS²PIP: HCD
, CID
, iTRAQ
,
iTRAQphospho
, TMT
, TTOF5600
, HCDch2
and CIDch2
. The last two "ch2"
models also include predictions for doubly charged fragment ions (b++ and y++),
next to the predictions for singly charged b- and y-ions.
MS² acquisition information and peptide properties of the models' training datasets
Model | Fragmentation method | MS² mass analyzer | Peptide properties |
---|---|---|---|
HCD | HCD | Orbitrap | Tryptic digest |
CID | CID | Linear ion trap | Tryptic digest |
iTRAQ | HCD | Orbitrap | Tryptic digest, iTRAQ-labeled |
iTRAQphospho | HCD | Orbitrap | Tryptic digest, iTRAQ-labeled, enriched for phosphorylation |
TMT | HCD | Orbitrap | Tryptic digest, TMT-labeled |
TTOF5600 | CID | Quadrupole Time-of-Flight | Tryptic digest |
HCDch2 | HCD | Orbitrap | Tryptic digest |
CIDch2 | CID | Linear ion trap | Tryptic digest |
Models, version numbers, and the train and test datasets used to create each model
Model | Current version | Train-test dataset (unique peptides) | Evaluation dataset (unique peptides) | Median Pearson correlation on evaluation dataset |
---|---|---|---|---|
HCD | v20190107 | MassIVE-KB (1 623 712) | PXD008034 (35 269) | 0.903786 |
CID | v20190107 | NIST CID Human (340 356) | NIST CID Yeast (92 609) | 0.904947 |
iTRAQ | v20190107 | NIST iTRAQ (704 041) | PXD001189 (41 502) | 0.905870 |
iTRAQphospho | v20190107 | NIST iTRAQ phospho (183 383) | PXD001189 (9 088) | 0.843898 |
TMT | v20190107 | Peng Lab TMT Spectral Library (1 185 547) | PXD009495 (36 137) | 0.950460 |
TTOF5600 | v20190107 | PXD000954 (215 713) | PXD001587 (15 111) | 0.746823 |
HCDch2 | v20190107 | MassIVE-KB (1 623 712) | PXD008034 (35 269) | 0.903786 (+) and 0.644162 (++) |
CIDch2 | v20190107 | NIST CID Human (340 356) | NIST CID Yeast (92 609) | 0.904947 (+) and 0.813342 (++) |
To train custom MS²PIP models, please refer to Training new MS²PIP models on our Wiki pages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for ms2pip-3.6.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f02f8733b1a19a5056c8627d07b7298eae9d5116bc5027731b84c63d9c90fc04 |
|
MD5 | d6497d23e0c2f5a90459486e962e8e93 |
|
BLAKE2b-256 | 54f8ec3a9dea7f21b716144900c97b3f351703ed8608821b5f696599fc7f00aa |
Hashes for ms2pip-3.6.2-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3821960f475f57422dc9a7325c941f3ae5f28de32ef8cd72c60b42a14e2d513 |
|
MD5 | 8386c0075c86c1a588563f5c10517268 |
|
BLAKE2b-256 | e96d38737f676ddc487c607fa8d18adaa02c46555855d0242a60b9d8fa007208 |
Hashes for ms2pip-3.6.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3ae31de0ccc012066415e91f3462eb929ebf4b4091e3df9b9b7f2aafcbac254 |
|
MD5 | 6ae5496ba5ebdecffb93528e05f4cb76 |
|
BLAKE2b-256 | e7126e4cc22c7bcb19bd3e8e10b5c024be1428c17f3cb33318be805b955246a5 |
Hashes for ms2pip-3.6.2-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e99c4d0b344c97e842303af9bf63c5fe5887668d0ff57aad5db8f1f792c1ddc8 |
|
MD5 | 6945d12cf2917a7db48f637dfdda1cbb |
|
BLAKE2b-256 | 9b65e4b57f0954a74e54281845c06936afcee19cf66e732ea96e5b2a74ad0b18 |
Hashes for ms2pip-3.6.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdf5eec41c6140271b78e19ac49f4ea8fef08887a6c75fef91418df57fd44213 |
|
MD5 | e965d8bdbc782cab345556f995592a88 |
|
BLAKE2b-256 | 1d5a00ea1da7b95fa3fd86367bbfc3b0da583256f3803c6878ce696ce2e0b3cf |
Hashes for ms2pip-3.6.2-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e85ed0faca5a2f2d24e73548f8f9fd5143ff9d9eca327d2f004262bfdbed513f |
|
MD5 | 0f9061e907abfefea45b6478a7493b37 |
|
BLAKE2b-256 | a48535242bd86ad330afa7e4eb13be085537ff9e570a7fc7380cc485e17528d1 |