Predicts peptide fragmentations using transformers

These details have not been verified by PyPI

Project links

Bug Tracker

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ElFragmentador

This repository attempts to implement a neural net that leverages the transformer architecture to predict peptide properties (retention time and fragmentation).

Installation

Since the project is currently in development mode, the best way to install is using pip from the cloned repo

git clone https://github.com/jspaezp/elfragmentador.git
cd elfragmentador

pip install /content/elfragmentador

Nontheless, there is also a pipy installable version

pip install elfragmentador

Usage

Training

# Be a good person and keep track of your experiments, use wandb
$ wandb login

elfragmentador_train \
     --run_name onecycle_5e_petite_ndl4 \
     --scheduler onecycle \
     --max_epochs 5 \
     --lr_ratio 25 \
     --terminator_patience 20 \
     --lr 0.00005 \
     --gradient_clip_val 1.0 \
     --dropout 0.1 \
     --nhead 4 \
     --nhid 512 \
     --ninp 224 \
     --num_decoder_layers 4 \
     --num_encoder_layers 2 \
     --batch_size 400 \
     --accumulate_grad_batches 1 \
     --precision 16 \
     --gpus 1 \
     --progress_bar_refresh_rate 5 \
     --data_dir  /content/20210217-traindata

Prediction

Check performance

I have implemented a way to compare the predictions of the model with an .sptxt file. I generate them by using comet > mokapot > spectrast but alternatives can be used.

elfragmentador_evaluate --sptxt {my_sptxt_file} {path_to_my_checkpoint}

Predict Spectra

You can use it from python like so ...

checkpoint_path = "some/path/to/a/checkpoint"
model = PepTransformerModel.load_from_checkpoint(checkpoint_path)

# Set the model as evaluation mode
_ = model.eval()
model.predict_from_seq("MYPEPTIDEK", charge=2, nce=27.0)

If you want to use graphical interface, I am currently working in a flask app to visualize the results.

It can be run using flask.

git clone https://github.com/jspaezp/elfragmentador.git
cd elfragmentador/viz_app

# Here you can install the dependencies using poetry
python main.py

Why transformers?

Because we can... Just kidding

The transformer architecture provides several benefits over the standard approach on fragment prediction (LSTM/RNN). On the training side it allows the parallel computation of whole sequences, whilst in LSTMs one element has to be passed at a time. In addition it gives the model itself a better chance to study the direct interactions between the elements that are being passed.

On the other hand, it allows a much better interpretability of the model, since the 'self-attention' can be visualized on the input and in that way see what the model is focusing on while generating the prediction.

Inspiration for this project

Many of the elements from this project are actually a combination of the principles shown in the Prosit paper and the Skyline poster on some of the elements to encode the peptides and the output fragment ions.

On the transformer side of things I must admit that many of the elements of this project are derived from DETR: End to end detection using transformers in particular the trainable embeddings as an input for the decoder and some of the concepts discussed about it on Yannic Kilcher's Youtube channel (which I highly recommend).

Why the name?

Two main reasons ... it translates to 'The fragmenter' in spanish and the project intends to predic framgnetations. On the other hand ... The name was free in pypi.

Resources on transformers

An amazing illustrated guide to understand the transformer architecture: http://jalammar.github.io/illustrated-transformer/
Full implementation of a transformer in pytorch with the explanation of each part: https://nlp.seas.harvard.edu/2018/04/03/attention.html
Official pytorch implementation of the transformer: https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html

How fast is it?

You can check how fast the model is in you specific system. Right now it tests only on CPU, message me if you need GPU inference times

poetry run pytest tests/test_model.py --benchmark-histogram

Currenty the inference time in an Intel i5-7260U is ~5.9ms, or ~167.44 predictions per second. On a GPU it is closer to ~1000 predictions per second.

How big is it?

I have explored many variations on the model but currently the one distributed is only ~4mb. Models up to 200mb have been tried and they don't really give a big improvement in performance.

"Common" questions

What scale are the retention times predicted.
- Out of the model it uses a scaled version of the Biognosys retention time scale, so if using the base model, you will need to multiply by 100 and then you will get something compatible with the iRT kit.
Is it any good?
- Well ... yes but if you want to see if it is good for you own data I have added an API to test the model on a spectral library (made with spectrast). Just get a checkpoint of the model, run the command: elfragmentador_evaluate {your_checkpoint.ckpt} {your_splib.sptxt}
- TODO add some benchmarking metrics to this readme ...
Crosslinked peptides?
- No
ETD ?
- No
CID ?
- No
Glycosilation ?
- No
Negative Mode ?
- No
No ?
- Not really ... I think all of those are interesting questions but AS IT IS RIGHT NOW it is not within the scope of the project. If you want to discuss it, write an issue in the repo and we can see if it is feasible.

Known Issues

When setting --max_spec on elfragmentador_evaluate --sptxt, the retention time accuracy is not calculated correctly because the retention times are scaled within the selected range. Since the spectra are subset in their import order, therefore only the first-eluting peptides are used.

TODO list

Urgent

Decouple to a different package with less dependencies the inference side of things
Make a better logging output for the training script
Complete dosctrings and add documentation website
Allow training with missing values (done for RT, not for spectra)
Migrate training data preparation script to snakemake
- In Progress

Possible

Add neutral losses specific to some PTMs
consider if using pyteomics as a backend for most ms-related tasks
Translate annotation functions (getting ions) to numpy/torch
Add weights during training so psms that are more likel to be false positives weight less

If I get time

Write ablation models and benchmark them (remove parts of the model and see how much worse it gets without it)

Acknowledgements

Purdue Univ for the computational resources for the preparation of the data (Brown Cluster).
Pytorch Lightning Team ... without this being open sourced this would not be posible.
Weights and Biases (same as above).

Project details

These details have not been verified by PyPI

Project links

Bug Tracker

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.57.0

Feb 23, 2023

0.56.0

Dec 9, 2022

0.55.1

Dec 6, 2022

0.55.0

Nov 27, 2022

0.55.0a2 pre-release

Nov 27, 2022

0.54.0

Sep 24, 2022

0.53.6

Sep 4, 2022

0.53.5

Sep 4, 2022

0.53.1

Sep 3, 2022

0.53.0a0 pre-release

Sep 3, 2022

0.52.0a0 pre-release

Mar 12, 2022

0.51.4

Mar 2, 2022

0.51.3

Mar 2, 2022

0.51.2

Mar 2, 2022

0.51.1

Oct 19, 2021

0.51.0

Oct 8, 2021

0.51.0a0 pre-release

Oct 8, 2021

0.50.6

Oct 6, 2021

0.50.5

Oct 6, 2021

0.50.1

Sep 15, 2021

0.50.1a2 pre-release

Sep 15, 2021

0.50.0

Sep 9, 2021

0.50.0b0 pre-release

Aug 30, 2021

0.49.5

Aug 25, 2021

0.49.4

Aug 25, 2021

0.49.3

Aug 25, 2021

0.49.1

Aug 24, 2021

0.49.0

Aug 23, 2021

0.48.3

Aug 20, 2021

0.48.2

Aug 19, 2021

This version

0.48.1

Aug 19, 2021

0.48.0a0 pre-release

Aug 19, 2021

0.47.0

Aug 19, 2021

0.45.0

Aug 16, 2021

0.34.0

Jul 13, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elfragmentador-0.48.1.tar.gz (242.7 kB view hashes)

Uploaded Aug 19, 2021 Source

Built Distribution

elfragmentador-0.48.1-py3-none-any.whl (265.1 kB view hashes)

Uploaded Aug 19, 2021 Python 3

Hashes for elfragmentador-0.48.1.tar.gz

Hashes for elfragmentador-0.48.1.tar.gz
Algorithm	Hash digest
SHA256	`2b05f50b70f1c7e46d4a78cb6419a0a9986528e8d9f0f4fe909b2db8cc0f3144`
MD5	`fde92e33b432850face8234588d6c89a`
BLAKE2b-256	`bf986b6aef213e9f2c38c001193b15f2ce9a02805d03a3adf9373816d621ad90`

Hashes for elfragmentador-0.48.1-py3-none-any.whl

Hashes for elfragmentador-0.48.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a74682f7142b7bed90543657a5cce995ced4caf29a6404a1f3c82be1d4819135`
MD5	`202db59866480f9cdfb677f1e58fdf7d`
BLAKE2b-256	`de9598e33d38622b94a8bc5a987c46db924a7b85e7b856b3f4dfbbd939d45f89`

elfragmentador 0.48.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

ElFragmentador

ElFragmentador

Installation

Usage

Training

Prediction

Check performance

Predict Spectra

Why transformers?

Inspiration for this project

Why the name?

Resources on transformers

How fast is it?

How big is it?

"Common" questions

Known Issues

TODO list

Urgent

Possible

If I get time

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution