Skip to main content

Transformer-diffusion de novo peptide sequencing for data-independent acquisition mass spectrometry

Project description

DiffNovo_DIA

Transformer-Diffusion de novo peptide sequencing for data-independent acquisition mass spectrometry.

DiffNovo_DIA is an extension of the Transformer_DIA model that integrates a transformer architecture with a diffusion model to improve de novo peptide sequencing from data-independent acquisition (DIA) mass spectrometry.


Installation

To manage dependencies efficiently, we recommend using conda. Start by creating a dedicated conda environment:

conda create --name diffnovo-dia python=3.10

Activate the environment:

conda activate diffnovo-dia

Install DiffNovo_DIA and its dependencies via pip:

pip install diffnovo-dia

To verify a successful installation, check the command-line interface:

diffnovo-dia --help

Data Preprocessing

Feature Extraction

To use DiffNovo_DIA, you must first generate a feature file that serves as structured input to the model. We provide a script which takes your spectrum and feature files as input and produces a pickle file containing the formatted features. The generated features include:

  • Keys: Peptide sequences
  • Values: List containing the following attributes:
    • precursor_mz
    • precursor_charge
    • scan_list_middle
    • ms1
    • mz_list
    • int_list
    • neighbor_right_count
    • neighbor_size_half

To run the script, use the provided script feature_extractor.py to generate feature file required by Transformer_DIA and DiffNovo_DIA models. The script takes the following inputs:

  • A feature CSV file
  • An MGF spectrum file It will generate a pickled feature file compatible with both diffnovo_dia and transformer_dia models.
python feature_extractor.py --feature_file your_feature.csv --spectrum_file your_spectra.mgf --output_file output_features.pkl

MGF Annotation

Use this script to annotate .mgf files with peptide sequences for models like PepNet, DiffNovo, Transformer-DIA and DiffNovo_DIA. The annotation process links each precursor ion in feature file to spectra in the MGF file.

For Transformer-DIA and DiffNovo_DIA, the recommended selection mode is five_rt, which automatically selects the top five spectra whose retention times are closest to the mean retention time of the corresponding precursor ion. The selection spectra for annotation should be aligned with spectra selection in feature extraction — for example, we used five_rt for both.

You can run the script using:

For annotating with PepNet, set the model_name to pepnet. For other models, leave the model_name empty.

python annotate_mgf.py --model_name pepnet --spectrum_file input.mgf --feature_file features.csv --selection five_rt

If you're not using PepNet, simply leave model_name empty:

python annotate_mgf.py --model_name "" --spectrum_file input.mgf --feature_file features.csv --selection five_rt

Both feature_extractor.py and annotate_mgf.py are located in the data_utils/ directory.

Usage

Predict Peptide Sequences

DiffNovo_DIA predicts peptide sequences from MS/MS spectra stored in MGF files. Predictions are saved as a CSV file:

diffnovo-dia --mode=denovo --model=pretrained_checkpoint.ckpt --peak_path=path/to/spectra.mgf --peak_feature=path/to/precursor_feature.pkl

Evaluate de novo Sequencing Performance

To assess the performance of de novo sequencing against known annotations:

diffnovo-dia --mode=eval --model=pretrained_checkpoint.ckpt --peak_path=path/to/spectra.mgf --peak_feature=path/to/precursor_feature.pkl

Annotations in the MGF file must include peptide sequences in the SEQ field.


Train a New Model

To train a new Transformer model from scratch, provide labeled training and validation datasets in MGF format:

diffnovo-dia --mode=train --peak_path=path/to/train/annotated_spectra.mgf \ 
--peak_feature=path/to/train/precursor_feature.pkl \
--peak_path_val=path/to/validation/annotated_spectra.mgf \
--peak_feature_val==path/to/validation/precursor_feature.pkl

MGF files must include peptide sequences in the SEQ field.


Fine-Tune an Existing Model

To fine-tune a pretrained Transformer-DIA model, set the --train_from_scratch parameter to false:

diffnovo-dia --mode=train --model=pretrained_checkpoint.ckpt \
--peak_feature=path/to/train/precursor_feature.pkl \
--peak_path_val=path/to/validation/annotated_spectra.mgf \
--peak_feature_val==path/to/validation/precursor_feature.pkl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffnovo_dia-0.1.2.tar.gz (66.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diffnovo_dia-0.1.2-py3-none-any.whl (76.5 kB view details)

Uploaded Python 3

File details

Details for the file diffnovo_dia-0.1.2.tar.gz.

File metadata

  • Download URL: diffnovo_dia-0.1.2.tar.gz
  • Upload date:
  • Size: 66.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.19

File hashes

Hashes for diffnovo_dia-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9e38bbbb34fae6632107e5f73112e51995c44cc6ff968d92bb3ce1be2f5c3f08
MD5 655e4ab2f2e4550108dbd2615d5f6037
BLAKE2b-256 19d299710ccf90999e271f0033612494733df2c91315b12d3a56c55e9b6787be

See more details on using hashes here.

File details

Details for the file diffnovo_dia-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: diffnovo_dia-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 76.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.19

File hashes

Hashes for diffnovo_dia-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 75b6077cfab18e2ab48562c944bea0e64411ce86233b69dc8042f979420282f8
MD5 3acdd937d7ee9d8b6baf18c9e8fe1bac
BLAKE2b-256 8ff92a29df18ba9b22708c106e78d067d3e854c75ec1376ca23c37c1a31fef8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page