Tools for mass spectrometry data analysis
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
ms-toolkit
Tools for mass spectrometry (MS) library searching and model training.
This library provides a pipeline for vectorizing spectra, training Word2Vec models, preselecting candidates using clustering/GMM, and searching using weighted cosine or embedding similarity. Portions of the code are adapted from the Spec2Vec project.
Features
- Parse MS library text files with optional progress UI
- Create
SpectrumDocumentobjects for Word2Vec training - Train and load Word2Vec models (
w2v.py) - Vectorize spectra and perform similarity search (
preprocessing.py,similarity.py) - Preselect candidates using KMeans or Gaussian Mixture Models (
preselector.py) - High-level
MSToolkitfacade wrapping the workflow (api.py)
Installation
Install with pip using the included setup.py:
pip install .
Dependencies include numpy, joblib, gensim, and scikit-learn. Optional UI
features require customtkinter or PySide6.
Quick Start: Open MassBank Workflow
Pretrained models for MassBank are provided for immediate use.
Note: MassBank is an open source mass spectral library for small molecule identification. It is freely available and can be downloaded directly through the ms-toolkit API.
from ms_toolkit.api import MSToolkit
# Initialize toolkit (defaults to open MassBank workflow)
toolkit = MSToolkit()
# Download and load the MassBank library (first run will download and cache)
toolkit.download_library()
# Load pretrained MassBank Word2Vec and preselector models
toolkit.load_w2v('models/massbank_25epochs.model')
toolkit.load_preselector('models/massbank_kmeans.pkl')
# Search using a query spectrum (list of (m/z, intensity) tuples)
query = [
(27.0, 0.09),
(39.0, 0.04),
(41.0, 0.09),
(43.0, 0.33),
(71.0, 0.28),
(114.0, 0.05),
]
results = toolkit.search_w2v(query)
for compound, score in results:
print(f'{compound}: {score:.3f}')
Advanced Usage
You can also train your own models or use other libraries (e.g., NIST) if available:
# Example: Train your own Word2Vec model
toolkit = MSToolkit(library_txt="your_library.txt")
toolkit.load_library()
toolkit.vectorize_library()
toolkit.train_preselector(save_path="my_kmeans.pkl")
toolkit.train_w2v(save_path="my_w2v.model")
License
This project is licensed under the Apache License 2.0. See LICENSE for details.
The NOTICE file explains that some code derives from Spec2Vec, which is also
Apache 2.0 licensed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ms_toolkit_nrel-0.1.2.tar.gz.
File metadata
- Download URL: ms_toolkit_nrel-0.1.2.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c28ea88528c8a96a753ca3b946038b6db28722c1c03fe76e960745b90c509930
|
|
| MD5 |
e0155166eef00438d52e4301c283cf9c
|
|
| BLAKE2b-256 |
a2852bb535799612fc24679d11d095e0fa564b00575de744322f17254954671f
|
Provenance
The following attestation bundles were made for ms_toolkit_nrel-0.1.2.tar.gz:
Publisher:
python-publish.yml on calebcoatney/ms-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ms_toolkit_nrel-0.1.2.tar.gz -
Subject digest:
c28ea88528c8a96a753ca3b946038b6db28722c1c03fe76e960745b90c509930 - Sigstore transparency entry: 240611550
- Sigstore integration time:
-
Permalink:
calebcoatney/ms-toolkit@898999a9d86a6a37dfb1e02b97b1a022497bf729 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/calebcoatney
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@898999a9d86a6a37dfb1e02b97b1a022497bf729 -
Trigger Event:
release
-
Statement type:
File details
Details for the file ms_toolkit_nrel-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ms_toolkit_nrel-0.1.2-py3-none-any.whl
- Upload date:
- Size: 28.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bf73666464cc77eb7b8a7ec02c1294508054fb6c122eb9d7220f8abe25f505c
|
|
| MD5 |
3bd941d567cefeabc15546a5aa70d6b4
|
|
| BLAKE2b-256 |
5e4e48a872e927fbd46a782cbec705ce921be53320713c875113f37bb21f0d22
|
Provenance
The following attestation bundles were made for ms_toolkit_nrel-0.1.2-py3-none-any.whl:
Publisher:
python-publish.yml on calebcoatney/ms-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ms_toolkit_nrel-0.1.2-py3-none-any.whl -
Subject digest:
4bf73666464cc77eb7b8a7ec02c1294508054fb6c122eb9d7220f8abe25f505c - Sigstore transparency entry: 240611562
- Sigstore integration time:
-
Permalink:
calebcoatney/ms-toolkit@898999a9d86a6a37dfb1e02b97b1a022497bf729 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/calebcoatney
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@898999a9d86a6a37dfb1e02b97b1a022497bf729 -
Trigger Event:
release
-
Statement type: