Skip to main content

Tools for mass spectrometry data analysis

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

ms-toolkit

Tools for mass spectrometry (MS) library searching and model training.

This library provides a pipeline for vectorizing spectra, training Word2Vec models, preselecting candidates using clustering/GMM, and searching using weighted cosine or embedding similarity. Portions of the code are adapted from the Spec2Vec project.

Features

  • Parse MS library text files with optional progress UI
  • Create SpectrumDocument objects for Word2Vec training
  • Train and load Word2Vec models (w2v.py)
  • Vectorize spectra and perform similarity search (preprocessing.py, similarity.py)
  • Preselect candidates using KMeans or Gaussian Mixture Models (preselector.py)
  • High-level MSToolkit facade wrapping the workflow (api.py)

Installation

Install with pip using the included setup.py:

pip install .

Dependencies include numpy, joblib, gensim, and scikit-learn. Optional UI features require customtkinter or PySide6.

Quick Start: Open MassBank Workflow

Pretrained models for MassBank are provided for immediate use.

Note: MassBank is an open source mass spectral library for small molecule identification. It is freely available and can be downloaded directly through the ms-toolkit API.

from ms_toolkit.api import MSToolkit

# Initialize toolkit (defaults to open MassBank workflow)
toolkit = MSToolkit()

# Download and load the MassBank library (first run will download and cache)
toolkit.download_library()

# Load pretrained MassBank Word2Vec and preselector models
toolkit.load_w2v('models/massbank_25epochs.model')
toolkit.load_preselector('models/massbank_kmeans.pkl')

# Search using a query spectrum (list of (m/z, intensity) tuples)
query = [
    (27.0, 0.09),
    (39.0, 0.04),
    (41.0, 0.09),
    (43.0, 0.33),
    (71.0, 0.28),
    (114.0, 0.05),
]
results = toolkit.search_w2v(query)
for compound, score in results:
    print(f'{compound}: {score:.3f}')

Advanced Usage

You can also train your own models or use other libraries (e.g., NIST) if available:

# Example: Train your own Word2Vec model
toolkit = MSToolkit(library_txt="your_library.txt")
toolkit.load_library()
toolkit.vectorize_library()
toolkit.train_preselector(save_path="my_kmeans.pkl")
toolkit.train_w2v(save_path="my_w2v.model")

License

This project is licensed under the Apache License 2.0. See LICENSE for details. The NOTICE file explains that some code derives from Spec2Vec, which is also Apache 2.0 licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ms_toolkit_nrel-0.1.2.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ms_toolkit_nrel-0.1.2-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file ms_toolkit_nrel-0.1.2.tar.gz.

File metadata

  • Download URL: ms_toolkit_nrel-0.1.2.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ms_toolkit_nrel-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c28ea88528c8a96a753ca3b946038b6db28722c1c03fe76e960745b90c509930
MD5 e0155166eef00438d52e4301c283cf9c
BLAKE2b-256 a2852bb535799612fc24679d11d095e0fa564b00575de744322f17254954671f

See more details on using hashes here.

Provenance

The following attestation bundles were made for ms_toolkit_nrel-0.1.2.tar.gz:

Publisher: python-publish.yml on calebcoatney/ms-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ms_toolkit_nrel-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ms_toolkit_nrel-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4bf73666464cc77eb7b8a7ec02c1294508054fb6c122eb9d7220f8abe25f505c
MD5 3bd941d567cefeabc15546a5aa70d6b4
BLAKE2b-256 5e4e48a872e927fbd46a782cbec705ce921be53320713c875113f37bb21f0d22

See more details on using hashes here.

Provenance

The following attestation bundles were made for ms_toolkit_nrel-0.1.2-py3-none-any.whl:

Publisher: python-publish.yml on calebcoatney/ms-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page