Skip to main content

Mirko meets Proteomics: A PRIDE downloader on steroids for deep learning-based DeNovo Sequencing

Project description

MMProteo

Mirko meets Proteomics

Getting started

Requirements

  • Python 3.8 (other versions haven't been tested yet)
  • preferably Linux (other OSes haven't been tested yet)

If you want, set up a virtual environment to encapsulate this software and its dependencies:

virtualenv --python=/usr/bin/python3 venv
source venv/bin/activate

Installation

...from Pip

pip install mmproteo

...from Source Code

cd mmproteo
python setup.py install

...as editable Development Installation from Source Code

cd mmproteo
pip install -e .

Running MMProteo

The installations will make mmproteo available on your command line (and in your PATH), so preferably use this anywhere you want.

Alternatively, you can run the mmproteo-runner.py script in the mmproteo directory.

If you are crazy, you can also run the mmproteo/mmproteo/mmproteo.py script, but this one is already part of the installed mmproteo package and shouldn't really be used directly.

Usage

Simply entering mmproteo will already show you the available parameters. Using mmproteo --help you can get a better image of the functionality. Currently, the following parameters are available:

$ mmproteo -h
usage: mmproteo [-h] [-p PRIDE_PROJECT] [-n MAX_NUM_FILES]
                [--count-failed-files] [-d STORAGE_DIR] [-l LOG_FILE]
                [--log-to-stdout] [-t VALID_FILE_EXTENSIONS] [-e] [-x] [-v]
                [--shown-columns SHOWN_COLUMNS]
                {download,info,ls} [{download,info,ls} ...]

positional arguments:
  {download,info,ls}    the list of actions to be performed on the repository.
                        Every action can only occur once. Duplicates are
                        dropped after the first occurrence.

optional arguments:
  -h, --help            show this help message and exit
  -p PRIDE_PROJECT, --pride-project PRIDE_PROJECT
                        the name of the PRIDE project, e.g. 'PXD010000' from '
                        https://www.ebi.ac.uk/pride/ws/archive/peptide/list/pr
                        oject/PXD010000'.For some commands, this parameter is
                        required. (default: None)
  -n MAX_NUM_FILES, --max-num-files MAX_NUM_FILES
                        the maximum number of files to be downloaded. Set it
                        to '0' to download all files. (default: 0)
  --count-failed-files  Count failed files and do not just skip them. This is
                        relevant for the max-num-files parameter. (default:
                        True)
  -d STORAGE_DIR, --storage-dir STORAGE_DIR
                        the name of the directory, in which the downloaded
                        files and the log file will be stored. (default:
                        ./pride)
  -l LOG_FILE, --log-file LOG_FILE
                        the name of the log file, relative to the download
                        directory. (default: downloader.log)
  --log-to-stdout       Log to stdout instead of stderr. (default: False)
  -t VALID_FILE_EXTENSIONS, --valid-file-extensions VALID_FILE_EXTENSIONS
                        a list of comma-separated allowed file extensions to
                        filter files for. An empty list deactivates filtering.
                        Capitalization does not matter. (default: )
  -e, --no-skip-existing
                        Do not skip existing files. (default: False)
  -x, --no-extract      Do not try to extract downloaded files. (default:
                        False)
  -v, --verbose         Increase output verbosity to debug level. (default:
                        False)
  --shown-columns SHOWN_COLUMNS
                        a list of comma-separated column names. Some commands
                        show their results as tables, so their output columns
                        will be limited to those in this list. An empty list
                        deactivates filtering. Capitalization matters.
                        (default: )

There are multiple commands (positional arguments) as well as several optional arguments available. A single command doesn't require or use all the available arguments. However, one can provide multiple commands (separated by spaces), which will all use their share of the given arguments.

Examples

Request Project Information for the Project PXD010000

$ mmproteo -p PXD010000 info
2021-01-21 23:02:14,035 - MMProteo: Logging to <stderr> and to file ./pride/downloader.log
2021-01-21 23:02:14,035 - MMProteo: Requesting project summary from https://www.ebi.ac.uk:443/pride/ws/archive/project/PXD010000
2021-01-21 23:02:15,847 - MMProteo: Received project summary for project "PXD010000"
{
    "accession": "PXD010000",
    "title": "DeNovo Peptide Identification Deep Learning Training Set",
    "projectDescription": "A benchmark set of bottom-up proteomics data for training deep learning networks. It has data from 51 organisms and includes nearly 1 million peptides.",
    "publicationDate": "2018-06-13",
    "submissionType": "COMPLETE",
    "numAssays": 235,
    "species": [
        "Delftia acidovorans (strain DSM 14801 / SPH-1)",
        "Francisella tularensis subsp. novicida (strain U112)",
        "Cupriavidus necator (strain ATCC 43291 / DSM 13513 / N-1) (Ralstonia eutropha)",
        "Rhizobium radiobacter (Agrobacterium tumefaciens) (Agrobacterium radiobacter)",
        "Shewanella oneidensis (strain MR-1)",
        "Bacteroides thetaiotaomicron (strain ATCC 29148 / DSM 2079 / NCTC 10582 / E50 / VPI-5482)",
        "Synechococcus elongatus (strain PCC 7942) (Anacystis nidulans R2)",
        "bacteria",
        "Micrococcus luteus (Micrococcus lysodeikticus)",
        "Methylomicrobium alcaliphilum (strain DSM 19304 / NCIMB 14124 / VKM B-2133 / 20Z)",
        "Mycobacterium smegmatis",
        "Rhodopseudomonas palustris",
        "Listeria monocytogenes serotype 1/2a (strain 10403S)",
        "Alcaligenes faecalis",
        "Legionella pneumophila",
        "Lactobacillus casei subsp. casei ATCC 393",
        "Acidiphilium cryptum (strain JF-5)",
        "Bacillus cereus (strain ATCC 14579 / DSM 31)",
        "Citrobacter freundii",
        "Chryseobacterium indologenes",
        "Myxococcus xanthus DZ2",
        "Fibrobacter succinogenes subsp. succinogenes S85",
        "Bacteroides fragilis (strain 638R)",
        "Rhodococcus sp. (strain RHA1)",
        "Clostridium ljungdahlii (strain ATCC 55383 / DSM 13528 / PETC)",
        "Coprococcus comes ATCC 27758",
        "Bacillus subtilis subsp. subtilis str. NCIB 3610",
        "Bacillus subtilis subsp. subtilis str. 168",
        "Anaerococcus hydrogenalis DSM 7454",
        "Algoriphagus marincola HL-49",
        "Stigmatella aurantiaca (strain DW4/3-1)",
        "Streptococcus agalactiae",
        "Bifidobacterium longum subsp. infantis (strain ATCC 15697 / DSM 20088 / JCM 1222 / NCTC 11817 / S12)",
        "Sulfobacillus thermosulfidooxidans",
        "Paenibacillus polymyxa ATCC 842",
        "Faecalibacterium prausnitzii SL3/3",
        "Cyanobacterium stanieri",
        "Cellvibrio gilvus (strain ATCC 13127 / NRRL B-14078)",
        "Dorea formicigenerans",
        "Streptomyces griseorubens",
        "Streptomyces sp.",
        "Bifidobacterium bifidum DSM 20456 = JCM 1255",
        "Paracoccus denitrificans",
        "Prevotella ruminicola (strain ATCC 19189 / JCM 8958 / 23)",
        "Campylobacter jejuni",
        "Ruminococcus gnavus ATCC 29149",
        "Pseudomonas putida KT2440",
        "Cellulophaga baltica 18"
    ],
    "tissues": [],
    "ptmNames": [
        "Oxidation"
    ],
    "instrumentNames": [
        "Q Exactive"
    ],
    "projectTags": [],
    "doi": "10.6019/PXD010000",
    "submitter": {
        "title": "Dr",
        "firstName": "Matthew",
        "lastName": "Monroe",
        "email": "matthew.monroe@pnnl.gov",
        "affiliation": "Pacific Northwest National Laboratory"
    },
    "labHeads": [
        {
            "title": "Dr",
            "firstName": "Samuel",
            "lastName": "Payne",
            "email": "samuel.payne@pnnl.gov",
            "affiliation": "Pacific Northwest National Laboratory"
        }
    ],
    "submissionDate": "2018-06-12",
    "reanalysis": "PXD005851",
    "experimentTypes": [
        "Shotgun proteomics"
    ],
    "quantificationMethods": [],
    "keywords": "machine learning, deep learning, bacterial diversity",
    "sampleProcessingProtocol": "Samples were digested with trypsin then analyzed by LC-MS/MS",
    "dataProcessingProtocol": "Data was searched with MSGF+ using PNNL's DMS Processing pipeline",
    "otherOmicsLink": null,
    "numProteins": 1494431,
    "numPeptides": 10324593,
    "numSpectra": 11774613,
    "numUniquePeptides": 7496530,
    "numIdentifiedSpectra": 0,
    "references": []
}

List all Information about 2 Files of Project PXD010000

$ mmproteo -p PXD010000 -n 2 ls
2021-01-21 23:06:08,280 - MMProteo: Logging to <stderr> and to file ./pride/downloader.log
2021-01-21 23:06:08,280 - MMProteo: Requesting list of project files from https://www.ebi.ac.uk/pride/ws/archive/file/list/project/PXD010000
2021-01-21 23:06:08,852 - MMProteo: Received list of 1175 project files
2021-01-21 23:06:08,852 - MMProteo: Showing only the top 2 entries because of the max_num_files parameter
  projectAccession assayAccession fileType fileSource   fileSize                                                                             fileName                                                                                                                                        downloadLink                                                                                                                                asperaDownloadLink
0        PXD010000          93137   RESULT  SUBMITTED   98085358  Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39_msgfplus.mzid.gz  ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39_msgfplus.mzid.gz  prd_ascp@fasp.ebi.ac.uk:pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39_msgfplus.mzid.gz
1        PXD010000          93137     PEAK  SUBMITTED  340204025           Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz           ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz           prd_ascp@fasp.ebi.ac.uk:pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz

Show selected Information about 5 RAW or mzML files of Project PXD010000

$ mmproteo -p PXD010000 -n 5 --shown-columns "projectAccession,fileName,fileSize" -t "raw,mzml" ls
2021-01-21 23:13:57,067 - MMProteo: Logging to <stderr> and to file ./pride/downloader.log
2021-01-21 23:13:57,067 - MMProteo: Requesting list of project files from https://www.ebi.ac.uk/pride/ws/archive/file/list/project/PXD010000
2021-01-21 23:13:57,679 - MMProteo: Received list of 1175 project files
2021-01-21 23:13:57,679 - MMProteo: Filtering repository files based on the following required file extensions ["mzml", "raw"] and the following optional file extensions ["zip", "gz"]
2021-01-21 23:13:57,680 - MMProteo: Showing only the top 5 entries because of the max_num_files parameter
2021-01-21 23:13:57,681 - MMProteo: Limiting the shown columns according to the shown_columns parameter
   projectAccession                                                                    fileName    fileSize
1         PXD010000  Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz   340204025
3         PXD010000      Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.raw   479591415
6         PXD010000                               Cj_media_MH_R5_23Feb15_Arwen_14-12-03.mzML.gz   578988435
8         PXD010000                                   Cj_media_MH_R5_23Feb15_Arwen_14-12-03.raw  1842140687
11        PXD010000                               Cj_media_MH_R4_23Feb15_Arwen_14-12-03.mzML.gz   698723177

Recognize, that the archive extension .gz of the mzML files as well as the capitalization of mzML is ignored while filtering the files. The same would state for the .zip extension. Other archive formats are not supported yet.

Tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmproteo-0.3.0.tar.gz (35.9 kB view hashes)

Uploaded Source

Built Distribution

mmproteo-0.3.0-py3-none-any.whl (31.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page