Skip to main content

A Python package for extracting mass spectrometry spectra from mzML files and converting them to MSP format

Project description

MSPeeps

A Python package for extracting mass spectrometry spectra from mzML files and converting them to MSP format.

Overview

This tool allows you to:

  • Extract spectra from mzML files using either spectrum index or retention time
  • Apply intensity cutoffs to filter peaks
  • Convert SMILES to InChI and InChIKey when available
  • Format the extracted data into MSP files according to standard conventions
  • Process multiple spectra in batch mode via a tabular input file (TSV or Excel)
  • Match peaks to molecular formulas using the fragfit package

Installation

From PyPI (Recommended)

Install the latest release from PyPI:

pip install mspeeps

From conda

conda install -c gkreder mspeeps

From Source

Clone the repository and install in development mode:

git clone https://github.com/gkreder/mspeeps.git
cd mspeeps
pip install -e .

Using pixi

If you prefer using pixi for dependency management:

git clone https://github.com/gkreder/mspeeps.git
cd mspeeps
pixi install

Usage

Command-line Interface

MSPeeps provides a flexible command-line interface with several subcommands:

mspeeps --help

Batch Processing (TSV file)

Process an input file using the default settings:

mspeeps batch input_file.tsv

Specify custom output directory and log file:

mspeeps batch input_file.tsv --output_dir my_output --log_file custom_log.log

Enable verbose logging:

mspeeps batch input_file.tsv --verbose

Extract Spectra

Extract a spectrum from an mzML file using its index:

mspeeps extract --mzml_file file.mzML --spectrum_index 123 --output spectrum.json

Extract a spectrum using retention time:

mspeeps extract --mzml_file file.mzML --retention_time 305.9 --ms_level 2 --format csv

Convert to MSP Format

Convert a JSON spectrum file to MSP:

mspeeps convert --json_file spectrum.json --output spectrum.msp

Convert separate m/z and intensity files:

mspeeps convert --mz_file mz_values.txt --intensity_file intensities.txt --output spectrum.msp

Match Formulas

Match m/z values to molecular formulas:

mspeeps match-formula --mz_values "30.03,55.05,84.08" --parent_formula "C5H11N" --tolerance 0.002

Or use a file with m/z values:

mspeeps match-formula --mz_values mz_values.txt --parent_formula "C5H11N" --format json

Convert SMILES

Convert a SMILES string to InChI and InChIKey:

mspeeps convert-smiles --smiles "N1CCCCC1"

Get mzML File Information

Get information about an mzML file:

mspeeps info --mzml_file file.mzML

Python API

import mspeeps
import pandas as pd

# Parse input file
df = mspeeps.parse_input_file("input_file.tsv")

# Process each row
for _, row in df.iterrows():
    output_path = mspeeps.process_file(row, output_dir="output")
    if output_path:
        print(f"Successfully processed: {output_path}")

# Or extract a spectrum directly
mz_array, intensity_array, metadata = mspeeps.extract_spectrum(
    mzml_path="file.mzML",
    spectrum_index=123,
    ms_level=2
)

# Format MSP data
msp_data = mspeeps.format_msp(
    mz_array,
    intensity_array,
    metadata,
    row_data={"Molecule_name": "Example"}
)

Input Format

The input should be a TSV or Excel file with the following columns:

Column Name Description Required?
Molecule_name Name of the molecule Yes
SMILES SMILES notation No
Molecular_formula Chemical formula No
Raw_Intensity_Cutoff Cutoff for peak intensity No (default: 0)
Formula_Matching_Tolerance Tolerance for formula matching (in Da) No
m/z Precursor m/z No
RT_seconds Retention time in seconds No*
RT Retention time in minutes No*
MS_level MS level No (default: 2)
Collision_energy Collision energy used No
mzML_filepath Path to the mzML file Yes
Spectrum_index Index of the spectrum in the mzML file No*

* Either Spectrum_index or retention time (RT_seconds/RT) must be provided.

Notes:

  • If both spectrum index and RT are provided, the index is used.
  • RT values in "min" format (e.g., "1.453 min") are automatically converted to seconds.

Output Format

The output is an MSP file for each spectrum with the following format:

NAME: [Molecule_name]
[Additional metadata from input file]
INCHI: [Calculated from SMILES if provided]
INCHIKEY: [Calculated from SMILES if provided]
RETENTIONTIME: [Retention time in seconds]
PRECURSORMZ: [Precursor m/z]
MSLEVEL: [MS level]
NUM PEAKS: [Number of peaks]
[m/z] [intensity]
[m/z] [intensity]
...

Formula Matching

The tool supports matching fragments in the spectrum to the closest possible molecular formula, within a specified tolerance, given the parent formula. This enables:

  • Fragment Formula Assignment: Each m/z peak is annotated with its most likely molecular formula
  • Exact Mass Calculation: The exact mass of each assigned formula is calculated
  • Enhanced Output Format: Peak lines include formula, exact mass, and m/z difference (actual - theoretical): [m/z] [intensity] "[formula]" [exact_mass] [m/z_difference]

When formula matching is enabled, the output MSP file will look like:

NAME: Piperidine
SMILES: N1CCCCC1
MOLECULAR FORMULA: C5H11N
RAW INTENSITY CUTOFF: 100.0
FORMULA MATCHING TOLERANCE: 0.002
M/Z: 86.09643
RT SECONDS: 305.9
MS LEVEL: 2
COLLISION ENERGY: 0 V
MZML FILEPATH: /path/to/file.mzML
SPECTRUM INDEX: 576
INCHI: InChI=1S/C5H11N/c1-2-4-6-5-3-1/h6H,1-5H2
INCHIKEY: WEVYAHXRMPXWCK-UHFFFAOYSA-N
RETENTIONTIME: 5.04
PRECURSORMZ: 86.096430
MSLEVEL: 2
NUM PEAKS: 5
30.033819 2461 "CH4N" 30.033826 -0.000007
55.054611 1497 "C4H7" 55.054227 0.000384
57.070259 356 "C4H9" 57.069877 0.000382
68.049652 568 "C5H6" 68.049476 0.000176
84.080811 2834 "C5H10N" 84.080776 0.000035

Development

Running Tests

pytest

With coverage:

pytest --cov=mspeeps

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mspeeps-0.1.3.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mspeeps-0.1.3-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file mspeeps-0.1.3.tar.gz.

File metadata

  • Download URL: mspeeps-0.1.3.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mspeeps-0.1.3.tar.gz
Algorithm Hash digest
SHA256 eecfc3450c4600a89f0de7aad22921eb902ff98648fe31eae570238ea8fcd0c9
MD5 6a0da0e5b77a3265eed7937cd79102d7
BLAKE2b-256 f1e38f5e8578f850bbdab78098d7db9a062ba0ac21025b74abd0b9662acac4f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for mspeeps-0.1.3.tar.gz:

Publisher: publish.yml on gkreder/MSPeeps

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mspeeps-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: mspeeps-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mspeeps-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a9324807bfd21a0aebfc24f2abad89b623d6df615e68e1d545f80a884333ed5b
MD5 7b2dbc0eade2b33daddd655806955f01
BLAKE2b-256 6624e5f9f095cdde1207e8fc28772d91a844200c5b7555e3934ce19c39fa28c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for mspeeps-0.1.3-py3-none-any.whl:

Publisher: publish.yml on gkreder/MSPeeps

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page