Skip to main content

Predicts peptide fragmentations using transformers

Project description

Pypi version Pypi Downloads Github Activity Python versions GitHub Actions License

ElFragmentador

ElFragmentador

This repository attempts to implement a neural net that leverages the transformer architecture to predict peptide properties (retention time and fragmentation).

Usage

Currently the documentation lives here: https://jspaezp.github.io/elfragmentador/ Please check out The Quickstart guide for usage instructions.

Why transformers?

Because we can... Just kidding

The transformer architecture provides several benefits over the standard approach on fragment prediction (LSTM/RNN). On the training side it allows the parallel computation of whole sequences, whilst in LSTMs one element has to be passed at a time. In addition it gives the model itself a better chance to study the direct interactions between the elements that are being passed.

On the other hand, it allows a much better interpretability of the model, since the 'self-attention' can be visualized on the input and in that way see what the model is focusing on while generating the prediction.

Inspiration for this project

Many of the elements from this project are actually a combination of the principles shown in the Prosit paper and the Skyline poster on some of the elements to encode the peptides and the output fragment ions.

On the transformer side of things I must admit that many of the elements of this project are derived from DETR: End to end detection using transformers in particular the trainable embeddings as an input for the decoder and some of the concepts discussed about it on Yannic Kilcher's Youtube channel (which I highly recommend).

Why the name?

Two main reasons ... it translates to 'The fragmenter' in spanish and the project intends to predict fragmentation. On the other hand ... The name was free in pypi.

Resources on transformers

How fast is it?

You can check how fast the model is in you specific system. Right now the CLI tests the speed only on CPU (the model can be run in GPU).

Here I will predict the fasta file for SARS-COV2

poetry run elfragmentador predict --fasta tests/data/fasta/uniprot-proteome_UP000464024_reviewed_yes.fasta --nce 32 --charges 2 --missed_cleavages 0 --min_length 20 --out foo.dlib
...
 99%|█████████▉| 1701/1721 [00:14<00:00, 118.30it/s]
...

~100 predictions per second including pre-post processing and writting the enciclopeDIA library. On a GPU it is closer to ~1000 preds/sec

How big is it?

I have explored many variations on the model but currently the one distributed is only ~4mb. Models up to 200mb have been tried and they don't really give a big improvement in performance.

"Common" questions

  • What scale are the retention times predicted.
    • Out of the model it uses a scaled version of the Biognosys retention time scale, so if using the base model, you will need to multiply by 100 and then you will get something compatible with the iRT kit.
  • Is it any good?
    • Well ... yes but if you want to see if it is good for you own data I have added an API to test the model on a spectral library (made with spectrast). Just get a checkpoint of the model, run the command: elfragmentador_evaluate {your_checkpoint.ckpt} {your_splib.sptxt}
    • TODO add some benchmarking metrics to this readme ...
  • Crosslinked peptides?
    • No
  • ETD ?
    • No
  • CID ?
    • No
  • Glycosilation ?
    • No
  • Negative Mode ?
    • No
  • No ?
    • Not really ... I think all of those are interesting questions but AS IT IS RIGHT NOW it is not within the scope of the project. If you want to discuss it, write an issue in the repo and we can see if it is feasible.

Acknowledgements

  1. Purdue Univ for the computational resources for the preparation of the data (Brown Cluster).
  2. Pytorch Lightning Team ... without this being open sourced this would not be posible.
  3. Weights and Biases (same as above).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elfragmentador-0.57.0.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

elfragmentador-0.57.0-py3-none-any.whl (48.1 kB view details)

Uploaded Python 3

File details

Details for the file elfragmentador-0.57.0.tar.gz.

File metadata

  • Download URL: elfragmentador-0.57.0.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.10 Linux/5.15.0-1033-azure

File hashes

Hashes for elfragmentador-0.57.0.tar.gz
Algorithm Hash digest
SHA256 3c0e16c9ec126717f7b28d9704cbb1aa38f66aea2ccc9ec44936db9d9d7ed6f5
MD5 84a00f124efc3ed05ee4054029ceacef
BLAKE2b-256 9ebb020f0588ee3b2b141e9e350af7e96cced9a852d2bf48e0ab98d6e63e3db2

See more details on using hashes here.

File details

Details for the file elfragmentador-0.57.0-py3-none-any.whl.

File metadata

  • Download URL: elfragmentador-0.57.0-py3-none-any.whl
  • Upload date:
  • Size: 48.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.10 Linux/5.15.0-1033-azure

File hashes

Hashes for elfragmentador-0.57.0-py3-none-any.whl
Algorithm Hash digest
SHA256 032052e3375a0f713bd68b6026013531ad4f22da870de1c52b5e427d8132ec3d
MD5 c4dcc5a25c7322c05a7f87ab481e158c
BLAKE2b-256 c90e12eae909af1d346d595838f7b00aa6525eaa0e4e03ea0156704bf44834ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page