A proteomics search engine for LC-MS1 spectra.
Project description
ms1searchpy - a DirectMS1 proteomics search engine for LC-MS1 spectra
ms1searchpy
consumes LC-MS data (mzML) or peptide features (tsv) and performs protein identification and quantitation.
Basic usage
Basic command for protein identification:
ms1searchpy *.mzML -d path_to.FASTA
or
ms1searchpy *_peptideFeatures.tsv -d path_to.FASTA
Read further for detailed info, including quantitative analysis.
Citing ms1searchpy
Ivanov et al. Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient. https://doi.org/10.1021/acs.jproteome.0c00863
Ivanov et al. DirectMS1: MS/MS-free identification of 1000 proteins of cellular proteomes in 5 minutes. https://doi.org/10.1021/acs.analchem.9b05095
Installation
Using pip:
pip install ms1searchpy
It is recommended to additionally install DeepLC; you may also want to install diffacto:
pip install deeplc diffacto
This should work on recent versions of Python (3.8-3.10).
Usage tutorial: protein identification
The script used for protein identification is called ms1searchpy
. It needs input files (mzML or tsv) and a FASTA database.
Input files
If mzML are provided, ms1searchpy will invoke biosaur2 to generate the features table. You can also use other software like Dinosaur or Biosaur, but biosaur2 is recommended. You can also make it yourself, the table must contain columns 'massCalib', 'rtApex', 'charge' and 'nIsotopes' columns.
How to get mzML files
To get mzML from RAW files, you can use Proteowizard MSConvert...
msconvert path_to_file.raw -o path_to_output_folder --mzML --filter "peakPicking true 1-" --filter "MS2Deisotope" --filter "zeroSamples removeExtra" --filter "threshold absolute 1 most-intense"
...or compomics ThermoRawFileParser, which produces suitable files with default parameters.
RT predictor
For protein identification, ms1searchpy
needs a retention time prediction model. The recommended one is DeepLC,
but you can also use the Elude predictor from Percolator or the built-in additive model (default).
Examples
ms1searchpy test.mzML -d sprot_human.fasta -deeplc deeplc -ad 1
This command will run ms1searchpy
with DeepLC RT predictor available as deeplc
(should work if you install DeepLC
alongside ms1searchpy
. -ad 1
creates a shuffled decoy database for FDR estimation.
You should use it only once and just use the created database for other searches.
ms1searchpy test.features.tsv -d sprot_human_shuffled.fasta -deeplc env_deeplc/bin/deeplc
Here, instead of mzML file, a file with peptide features is used. Also, DeepLC is installed in a separate environment, so a path is specified.
Output files
ms1searchpy
produces several tables:
- findetified proteins, FDR-filtered (
sample.features_proteins.tsv
); - all identified proteins (
sample.features_proteins_full.tsv
) - this is the main result; - all identified proteins based on all PFMs (
sample.features_proteins_full_noexclusion.tsv
); - all matched peptide match fingerprints, or peptide-feature matches (
sample.features_PFMs.tsv
); - all PFMs with features prepared for Machnine Learning (
sample.features_PFMs_ML.tsv
); - number of theoretical peptides per protein (
sample.features_protsN.tsv
); - log file with estimated mass and RT accuracies (
sample.features_log.txt
).
Combine results from replicates
You can combine the results from several replicate runs with ms1combine
by feeding it _PFMs_ML.tsv
tables:
ms1combine sample_rep_*.features_PFMs_ML.tsv
Usage tutorial: Quantitation
After obtaining the protein identification results, you can proceed to compare your samples using LFQ.
Using diffacto
Here's an example where we use Bourne Shell syntax for brevity. Each sample contains three replicates:
ms1todiffacto -dif diffacto -S1 sample1_r{1,2,3}.features_proteins.tsv -S2 sample2_r{1,2,3}.features_proteins.tsv -norm median -out diffacto_output.tsv -min_samples 3
ms1todiffacto
prepares input file for diffacto from ms1searchpy output and to automatically runs diffacto.
Using directms1quant
New LFQ method designed specifically for DirectMS1 is invoked like this:
directms1quant -S1 sample1_r{1,2,3}.features_proteins_full.tsv -S2 sample2_r{1,2,3}.features_proteins_full.tsv -min_samples 3
It produces a filtered table of significantly changed proteins with p-values and fold changes, as well as the full protein table and a separate file simply listing all IDs of significantly modified proteins (e.g. for easy copy-paste into a StringDB search window).
Links
-
GitHub repo & issue tracker: https://github.com/markmipt/ms1searchpy
-
Mailing list: markmipt@gmail.com
-
Diffacto repo: https://github.com/statisticalbiotechnology/diffacto
-
DeepLC repo: https://github.com/compomics/DeepLC
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file ms1searchpy-2.3.10-py3-none-any.whl
.
File metadata
- Download URL: ms1searchpy-2.3.10-py3-none-any.whl
- Upload date:
- Size: 61.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b39c16f10f866d84464167947424b2ad235428975ef5bf1ca811b78fb7c3b9f |
|
MD5 | 2f4cd68afb4f70282e044242558a7fa1 |
|
BLAKE2b-256 | 129dadda6b2905a3e7f5a0e703e711dbabc431b0e8eb2233d7ce1adaf0b92436 |