Skip to main content

3DMolMS: prediction of tandem mass spectra from 3D molecular conformations

Project description

3DMolMS

CC BY-NC-SA 4.0 (free for academic use)

3D Molecular Network for Mass Spectra Prediction (3DMolMS) is a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. This model's molecular representation, learned through MS/MS prediction tasks, can be further applied to enhance performance in other molecular-related tasks, such as predicting retention times (RT) and collision cross sections (CCS).

Read paper in Bioinformatics | Try online service at GNPS | Try model on Konia | Install from PyPI

🆕 3DMolMS v1.1.10 is now available for inference on Konia, GNPS, and PyPI!

The changes log can be found at [CHANGE_LOG.md].

Installation

3DMolMS is available on PyPI (molnetpack). You can install the latest version using pip:

pip install molnetpack

# PyTorch must be installed separately. 
# Please check the official website of PyTorch for the proper version:
# https://pytorch.org/get-started/locally/
# e.g.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

3DMolMS can also be installed through source codes:

git clone https://github.com/JosieHong/3DMolMS.git
cd 3DMolMS

pip install .

Usage

To get started quickly, you can instantiate a MolNet and load a CSV or MGF file for MS/MS prediction as:

import torch
from molnetpack import MolNet, plot_msms

# Set the device to CPU for CPU-only usage:
device = torch.device("cpu")

# For GPU usage, set the device as follows (replace '0' with your desired GPU index):
# gpu_index = 0
# device = torch.device(f"cuda:{gpu_index}")

# Instantiate a MolNet object
molnet_engine = MolNet(device, seed=42) # The random seed can be any integer. 

# Load input data (here we use a CSV file as an example)
molnet_engine.load_data(path_to_test_data='./test/input_msms.csv')
"""Load data from the specified path.
Args:
    path_to_test_data (str): Path to the test data file. Supported formats are 'csv', 'mgf', and 'pkl'.
Returns:
    None
"""

# Predict MS/MS
pred_spectra_df = molnet_engine.pred_msms(instrument='qtof')
"""Predict MS/MS spectra.
Args:
    path_to_results (Optional[str]): Path to save the prediction results. Supports '.mgf' or '.csv' formats. If None, the results won't be saved. 
    path_to_checkpoint (Optional[str]): Path to the model checkpoint. If None, the model will be downloaded from a default URL.
    instrument (str): Type of instrument used ('qtof' or 'orbitrap').
Returns:
    pd.DataFrame: DataFrame containing the predicted MS/MS results.
"""

We also implement a function to plot the predicted results.

# Plot the predicted MS/MS with 3D molecular conformation
plot_msms(pred_spectra_df, dir_to_img='./img/')

The sample input files, a CSV and an MGF, are located at ./test/demo_input.csv and ./test/demo_input.mgf, respectively. It's important to note that during the data loading phase, any input formats that are not supported will be automatically excluded. Below is a table outlining the types of input data that are supported:

Item Supported input
Atom number <=300
Atom types 'C', 'O', 'N', 'H', 'P', 'S', 'F', 'Cl', 'B', 'Br', 'I', 'Na'
Precursor types '[M+H]+', '[M-H]-', '[M+H-H2O]+', '[M+Na]+', '[M+2H]2+'
Collision energy any number

Below is an example of a predicted MS/MS spectrum plot.

A more detailed documentation for various tasks using molnetpack or source code can be found in the docs/ directory, which includes the following:

  • ./docs/
    • PROP_USAGE.md: Guide on using molnetpack for RT prediction, CCS prediction, and molecular embedding.
    • MSMS_PRED.md: Instructions for using 3DMolMS to predict MS/MS spectra from your own CSV files via the source code. The training details can be found in the next section.
    • GEN_REFER_LIB.md: Instructions for using 3DMolMS to generate MS/MS reference libraries from small molecule databases, such as HMDB and RefMet, via the source code.
    • PROP_PRED.md: Instructions for training and testing 3DMolMS on RT and CCS prediction via the source code.
    • PRETRAIN.md: Instructions for pretraining 3DMolMS on the QM9 dataset via the source code.

Train your own model

Step 0: Clone the Repository and Set Up the Environment

Clone the 3DMolMS repository and install the required packages using the following commands:

git clone https://github.com/JosieHong/3DMolMS.git
cd 3DMolMS

# Please install the packages if you have not installed them yet. 
pip install .

Step 1: Obtain the Pretrained Model

Download the pretrained model (molnet_pre_etkdgv3.pt.zip) from Releases. You can also train the model from scratch. For details on pretraining the model on the QM9 dataset, refer to PRETRAIN.md.

Step 2: Prepare the Datasets

Download and organize the datasets into the ./data/ directory. The current version uses four datasets:

  1. Agilent DPCL, provided by Agilent Technologies.
  2. NIST20, available under license for academic use.
  3. MoNA, publicly available.
  4. Waters QTOF, our own experimental dataset.

The data directory structure should look like this:

|- data
  |- origin
    |- Agilent_Combined.sdf
    |- Agilent_Metlin.sdf
    |- hr_msms_nist.SDF
    |- MoNA-export-All_LC-MS-MS_QTOF.sdf
    |- MoNA-export-All_LC-MS-MS_Orbitrap.sdf
    |- waters_qtof.mgf

Step 3: Preprocess the Datasets

Run the following commands to preprocess the datasets. Specify the dataset with --dataset and select the instrument type as qtof. Use --maxmin_pick to apply the MaxMin algorithm for selecting training molecules; otherwise, selection will be random. The dataset configurations are in ./src/molnetpack/config/preprocess_etkdgv3.yml.

python ./src/preprocess.py --dataset agilent nist mona waters gnps \
--instrument_type qtof orbitrap \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--mgf_dir ./data/mgf_debug/ 

Step 4: Train the Model

Use the following commands to train the model. Configuration settings for the model and training process are located in ./src/molnetpack/config/molnet.yml.

# Train the model from pretrain: 
# Q-TOF (Orbitrap is ignored here.): 
python ./src/train.py --train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3.pt \
--transfer --resume_path ./check_point/molnet_pre_etkdgv3.pt \
--ex_model_path ./check_point/molnet_qtof_etkdgv3_jit.pt

# Train the model from scratch
# Q-TOF: 
python ./src/train.py --train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3.pt \
--ex_model_path ./check_point/molnet_qtof_etkdgv3_jit.pt
# Orbitrap: 
python ./src/train.py --train_data ./data/orbitrap_etkdgv3_train.pkl \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--checkpoint_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--ex_model_path ./check_point/molnet_orbitrap_etkdgv3_jit.pt 

Step 5: Evaluation

Let's evaluate the model trained above!

# Predict the spectra: 
# Q-TOF: 
python ./src/pred.py \
--test_data ./data/qtof_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--resume_path ./check_point/molnet_qtof_etkdgv3.pt \
--result_path ./result/pred_qtof_etkdgv3_test.mgf 
# Orbitrap: 
python ./src/pred.py \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--resume_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--result_path ./result/pred_orbitrap_etkdgv3_test.mgf 

# Evaluate the cosine similarity between experimental spectra and predicted spectra:
# Q-TOF: 
python ./src/eval.py ./data/qtof_etkdgv3_test.pkl ./result/pred_qtof_etkdgv3_test.mgf \
./eval_qtof_etkdgv3_test.csv ./eval_qtof_etkdgv3_test.png
# Orbitrap: 
python ./src/eval.py ./data/orbitrap_etkdgv3_test.pkl ./result/pred_orbitrap_etkdgv3_test.mgf \
./eval_orbitrap_etkdgv3_test.csv ./eval_orbitrap_etkdgv3_test.png

Additional application

3DMolMS is also capable of predicting molecular properties and generating reference libraries for molecular identification. For more details, refer to PROP_PRED.md and GEN_REFER_LIB.md respectively.

Citation

@article{hong20233dmolms,
  title={3DMolMS: prediction of tandem mass spectra from 3D molecular conformations},
  author={Hong, Yuhui and Li, Sujun and Welch, Christopher J and Tichy, Shane and Ye, Yuzhen and Tang, Haixu},
  journal={Bioinformatics},
  volume={39},
  number={6},
  pages={btad354},
  year={2023},
  publisher={Oxford University Press}
}
@article{hong2024enhanced,
  title={Enhanced structure-based prediction of chiral stationary phases for chromatographic enantioseparation from 3D molecular conformations},
  author={Hong, Yuhui and Welch, Christopher J and Piras, Patrick and Tang, Haixu},
  journal={Analytical Chemistry},
  volume={96},
  number={6},
  pages={2351--2359},
  year={2024},
  publisher={ACS Publications}
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molnetpack-1.1.10.post1.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

molnetpack-1.1.10.post1-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file molnetpack-1.1.10.post1.tar.gz.

File metadata

  • Download URL: molnetpack-1.1.10.post1.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for molnetpack-1.1.10.post1.tar.gz
Algorithm Hash digest
SHA256 05f85f2f5c6485750ec658be2d24f946f3659abc2e11534d000925fdcb467853
MD5 a069049354abfa2ce8ff4d1bbaf0b76c
BLAKE2b-256 811a2c7a58b41d090f444b290269b77821733435494b4852daf700fca629c6d1

See more details on using hashes here.

File details

Details for the file molnetpack-1.1.10.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for molnetpack-1.1.10.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 4d3e96e1ad26eae4b333f3130d66759aa0b4013cbef4f6c63d9259adaa4146a8
MD5 f0c4452d22a126b45778b2173172445b
BLAKE2b-256 932f5fe4c80974fac0b815f10bf62062d885053eefa5227ca45f6f1a51d2da72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page