MassSpecGym: A benchmark for the discovery and identification of molecules

These details have not been verified by PyPI

Project description

MassSpecGym: A benchmark for the discovery and identification of molecules

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra:

💥 De novo molecular generation (MS/MS spectrum → molecular structure)
- 🎆 Bonus chemical formulae challenge (MS/MS spectrum + chemical formula → molecular structure)
💥 Molecular retrieval (MS/MS spectrum → ranked list of candidate molecular structures)
- 🎆 Bonus chemical formulae challenge (MS/MS spectrum + chemical formula → ranked list of candidate molecular structures)
💥 Spectrum simulation (molecular structure → MS/MS spectrum)

The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems with pre-defined datasets, data splits, and evaluation metrics.

📣 The paper will be available soon!

📦 Installation

Installation is available via pip:

pip install massspecgym

If you use conda, we recommend creating and activating a new environment before installing MassSpecGym:

conda create -n massspecgym python==3.11
conda activate massspecgym

If you are planning to run Jupyter notebooks provided in the repository or contribute to the project, we recommend installing the optional dependencies:

pip install massspecgym[notebooks, dev]

🍩 Getting started with MassSpecGym

MassSpecGym’s infrastructure consists of predefined components that serve as building blocks for the implementation and evaluation of new models.

First of all, the MassSpecGym dataset is available as a Hugging Face dataset and can be downloaded within the code into a pandas DataFrame as follows.

from massspecgym.utils import load_massspecgym
df = load_massspecgym()

Second, MassSpecGym provides a set of transforms for spectra and molecules, which can be used to preprocess data for machine learning models. These transforms can be applied in conjunction with the MassSpecDataset class (or its subclasses), resulting in a PyTorch Dataset object that implicitly applies the specified transforms to each data point. Note that MassSpecDataset also downloads the dataset from the Hugging Face repository as needed.

from massspecgym.data import MassSpecDataset
from massspecgym.transforms import SpecTokenizer, MolFingerprinter

dataset = MassSpecDataset(
    spec_transform=SpecTokenizer(n_peaks=60),
    mol_transform=MolFingerprinter(),
)

Third, MassSpecGym provides a MassSpecDataModule, a PyTorch Lightning LightningDataModule that automatically handles data splitting into training, validation, and testing folds, as well as loading data into batches.

from massspecgym.data import MassSpecDataModule

data_module = MassSpecDataModule(
    dataset=dataset,
    batch_size=32
)

Finally, MassSpecGym defines evaluation metrics by implementing abstract subclasses of LightningModule for each of the MassSpecGym challenges: DeNovoMassSpecGymModel, RetrievalMassSpecGymModel, and SimulationMassSpecGymModel. To implement a custom model, you should inherit from the appropriate abstract class and implement the forward and step methods. This procedure is described in the next section. If you looking for more examples, please see the massspecgym/models folder.

🚀 Train and evaluate your model

MassSpecGym allows you to implement, train, validate, and test your model with a few lines of code. Built on top of PyTorch Lightning, MassSpecGym abstracts data preparation and splitting while eliminating boilerplate code for training and evaluation loops. To train and evaluate your model, you only need to implement your custom architecture and prediction logic.

Below is an example of how to implement a simple model based on DeepSets for the molecule retrieval task. The model is trained to predict the fingerprint of a molecule from its spectrum and then retrieves the most similar molecules from a set of candidates based on fingerprint similarity. For more examples, please see notebooks/demo.ipynb.

Import necessary modules:

import torch
import torch.nn as nn
import pytorch_lightning as pl
from pytorch_lightning import Trainer

from massspecgym.data import RetrievalDataset, MassSpecDataModule
from massspecgym.data.transforms import SpecTokenizer, MolFingerprinter
from massspecgym.models.base import Stage
from massspecgym.models.retrieval.base import RetrievalMassSpecGymModel

Implement your model:

class MyDeepSetsRetrievalModel(RetrievalMassSpecGymModel):
    def __init__(
        self,
        hidden_channels: int = 128,
        out_channels: int = 4096,  # fingerprint size
        *args,
        **kwargs
    ):
        """Implement your architecture."""
        super().__init__(*args, **kwargs)

        self.phi = nn.Sequential(
            nn.Linear(2, hidden_channels),
            nn.ReLU(),
            nn.Linear(hidden_channels, hidden_channels),
            nn.ReLU(),
        )
        self.rho = nn.Sequential(
            nn.Linear(hidden_channels, hidden_channels),
            nn.ReLU(),
            nn.Linear(hidden_channels, out_channels),
            nn.Sigmoid()
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Implement your prediction logic."""
        x = self.phi(x)
        x = x.sum(dim=-2)  # sum over peaks
        x = self.rho(x)
        return x

    def step(
        self, batch: dict, stage: Stage
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """Implement your custom logic of using predictions for training and inference."""
        # Unpack inputs
        x = batch["spec"]  # input spectra
        fp_true = batch["mol"]  # true fingerprints
        cands = batch["candidates"]  # candidate fingerprints concatenated for a batch
        batch_ptr = batch["batch_ptr"]  # number of candidates per sample in a batch

        # Predict fingerprint
        fp_pred = self.forward(x)

        # Calculate loss
        loss = nn.functional.mse_loss(fp_true, fp_pred)

        # Calculate final similarity scores between predicted fingerprints and retrieval candidates
        fp_pred_repeated = fp_pred.repeat_interleave(batch_ptr, dim=0)
        scores = nn.functional.cosine_similarity(fp_pred_repeated, cands)

        return dict(loss=loss, scores=scores)

Train and validate your model:

# Init hyperparameters
n_peaks = 60
fp_size = 4096
batch_size = 32

# Load dataset
dataset = RetrievalDataset(
    spec_transform=SpecTokenizer(n_peaks=n_peaks),
    mol_transform=MolFingerprinter(fp_size=fp_size),
)

# Init data module
data_module = MassSpecDataModule(
    dataset=dataset,
    batch_size=batch_size,
    num_workers=4
)

# Init model
model = MyDeepSetsRetrievalModel(out_channels=fp_size)

# Init trainer
trainer = Trainer(accelerator="cpu", devices=1, max_epochs=5)

# Train
trainer.fit(model, datamodule=data_module)

Test your model:

# Test
trainer.test(model, datamodule=data_module)

Submit your results to the leaderboard

TODO

References

If you use MassSpecGym in your work, please cite the following paper:

TODO

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.11

Release history Release notifications | RSS feed

1.3.1

Mar 23, 2025

1.3.0

Mar 23, 2025

1.2.2

Feb 15, 2025

1.2.1

Feb 10, 2025

1.2.0

Feb 10, 2025

1.1.1

Nov 10, 2024

1.1.0

Nov 10, 2024

1.0.3

Oct 29, 2024

This version

1.0.2

Oct 29, 2024

1.0.1

Oct 28, 2024

1.0.0

Oct 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

massspecgym-1.0.2.tar.gz (47.8 kB view details)

Uploaded Oct 29, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

massspecgym-1.0.2-py3-none-any.whl (56.7 kB view details)

Uploaded Oct 29, 2024 Python 3

File details

Details for the file massspecgym-1.0.2.tar.gz.

File metadata

Download URL: massspecgym-1.0.2.tar.gz
Upload date: Oct 29, 2024
Size: 47.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for massspecgym-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`9ee594b55cd20234b4edbf170fecc286de938176017b680cb7736af55bfd2e02`
MD5	`2673658f5cd7090e7d67a4373606a34a`
BLAKE2b-256	`7c943dcc762361329fb1bc97642917ba5fff1474f0e53ee162d0630b24b3222c`

See more details on using hashes here.

File details

Details for the file massspecgym-1.0.2-py3-none-any.whl.

File metadata

Download URL: massspecgym-1.0.2-py3-none-any.whl
Upload date: Oct 29, 2024
Size: 56.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for massspecgym-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d1c61900721ac2de023e3b20411c2096b6a6a49972caa33bd16fa5c1d334e37`
MD5	`b467553ba26120c9260208d2125a4d98`
BLAKE2b-256	`5828e3f9f071de71d1a904c1dfb566b518c9deca0ddd4754e5bd165892b20f71`

See more details on using hashes here.

massspecgym 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

MassSpecGym: A benchmark for the discovery and identification of molecules

📦 Installation

🍩 Getting started with MassSpecGym

🚀 Train and evaluate your model

Submit your results to the leaderboard

References

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes