Skip to main content

De novo sequencing with InstaNovo

Project description

De novo peptide sequencing with InstaNovo

PyPI version Open In Colab

The official code repository for InstaNovo. This repo contains the code for training and inference of InstaNovo and InstaNovo+. InstaNovo is a transformer neural network with the ability to translate fragment ion peaks into the sequence of amino acids that make up the studied peptide(s). InstaNovo+, inspired by human intuition, is a multinomial diffusion model that further improves performance by iterative refinement of predicted sequences.

Graphical Abstract

Links:

Developed by:

Usage

Installation

To use InstaNovo, we need to install the module via pip:

pip install instanovo

It is recommended to install InstaNovo in a fresh environment, such as Conda or PyEnv. For example, if you have conda/miniconda installed:

conda create -n instanovo python=3.8
conda activate instanovo

Note: InstaNovo is built for Python >= 3.8

Training

To train auto-regressive InstaNovo:

usage: python -m instanovo.transformer.train train_path valid_path [-h] [--config CONFIG] [--n_gpu N_GPU] [--n_workers N_WORKERS]

required arguments:
  train_path        Training data path
  valid_path        Validation data path

optional arguments:
  --config CONFIG   file in configs folder
  --n_workers N_WORKERS

Note: data is expected to be saved as Polars .ipc format. See section on data conversion.

To update the InstaNovo model config, modify the config file under configs/instanovo/base.yaml

Prediction

To evaluate InstaNovo:

usage: python -m instanovo.transformer.predict data_path model_path [-h] [--denovo] [--config CONFIG] [--subset SUBSET] [--knapsack_path KNAPSACK_PATH] [--n_workers N_WORKERS]

required arguments:
  data_path         Evaluation data path
  model_path        Model checkpoint path

optional arguments:
  --denovo          evaluate in de novo mode, will not try to compute metrics
  --output_path OUTPUT_PATH
                    Save predictions to a csv file (required in de novo mode)
  --subset SUBSET
                    portion of set to evaluate
  --knapsack_path KNAPSACK_PATH
                    path to pre-computed knapsack
  --n_workers N_WORKERS

Using your own datasets

To use your own datasets, you simply need to tabulate your data in either Pandas or Polars with the following schema:

The dataset is tabular, where each row corresponds to a labelled MS2 spectra.

  • sequence (string) [Optional]
    The target peptide sequence excluding post-translational modifications
  • modified_sequence (string)
    The target peptide sequence including post-translational modifications
  • precursor_mz (float64)
    The mass-to-charge of the precursor (from MS1)
  • charge (int64)
    The charge of the precursor (from MS1)
  • mz_array (list[float64])
    The mass-to-charge values of the MS2 spectrum
  • mz_array (list[float32])
    The intensity values of the MS2 spectrum

For example, the DataFrame for the Nine-Species excluding Yeast dataset look as follows:

sequence modified_sequence precursor_mz precursor_charge mz_array intensity_array
0 GRVEGMEAR GRVEGMEAR 335.502 3 [102.05527 104.052956 113.07079 ...] [ 767.38837 2324.8787 598.8512 ...]
1 IGEYK IGEYK 305.165 2 [107.07023 110.071236 111.11693 ...] [ 1055.4957 2251.3171 35508.96 ...]
2 GVSREEIQR GVSREEIQR 358.528 3 [103.039444 109.59844 112.08704 ...] [801.19995 460.65268 808.3431 ...]
3 SSYHADEQVNEASK SSYHADEQVNEASK 522.234 3 [101.07095 102.0552 110.07163 ...] [ 989.45154 2332.653 1170.6191 ...]
4 DTFNTSSTSNSTSSSSSNSK DTFNTSSTSN(+.98)STSSSSSNSK 676.282 3 [119.82458 120.08073 120.2038 ...] [ 487.86942 4806.1377 516.8846 ...]

For de novo prediction, the modified_sequence column is not required.

We also provide a conversion script for converting to Polars IPC binary (.ipc):

usage: python -m instanovo.utils.convert_to_ipc source target [-h] [--source_type {mgf,mzml,csv}] [--max_charge MAX_CHARGE] [--verbose]

positional arguments:
  source                source file or folder
  target                target ipc file to be saved

optional arguments:
  -h, --help            show this help message and exit
  --source_type {mgf,mzml,csv}
                        type of input data
  --max_charge MAX_CHARGE
                        maximum charge to filter out

Note: we currently only support mzml, mgf and csv conversions.

If you want to use InstaNovo for evaluating metrics, you will need to manually set the modified_sequence column after conversion.

Roadmap

This code repo is currently under construction.

ToDo:

  • Add data preprocessing pipeline
  • Multi-GPU support

License

Code is licensed under the Apache License, Version 2.0 (see LICENSE)

The model checkpoints are licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)

BibTeX entry and citation info

@article{eloff_kalogeropoulos_2023_instanovo,
	title = {De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments},
	author = {Kevin Eloff and Konstantinos Kalogeropoulos and Oliver Morell and Amandla Mabona and Jakob Berg Jespersen and Wesley Williams and Sam van Beljouw and Marcin Skwark and Andreas Hougaard Laustsen and Stan J. J. Brouns and Anne Ljungars and Erwin Marten Schoof and Jeroen Van Goey and Ulrich auf dem Keller and Karim Beguir and Nicolas Lopez Carranza and Timothy Patrick Jenkins},
	year = {2023},
	doi = {10.1101/2023.08.30.555055},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/10.1101/2023.08.30.555055v2},
	journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instanovo-0.1.6.tar.gz (49.7 kB view details)

Uploaded Source

Built Distribution

instanovo-0.1.6-py3-none-any.whl (56.6 kB view details)

Uploaded Python 3

File details

Details for the file instanovo-0.1.6.tar.gz.

File metadata

  • Download URL: instanovo-0.1.6.tar.gz
  • Upload date:
  • Size: 49.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for instanovo-0.1.6.tar.gz
Algorithm Hash digest
SHA256 5e36bb8e83f663322cf24e73c09d5fc39f8a8149564e27dde36134b096fdf7d9
MD5 225878c16686ca57d59e7d94b0aee7b6
BLAKE2b-256 b23a90a53bcf12a4a56441f6460a96c757c558489824ee10e1f221ee6a12ecf1

See more details on using hashes here.

File details

Details for the file instanovo-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: instanovo-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 56.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for instanovo-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 67f76f069ee97c856447854bb3cf1914e017187a2e173ba4b4b8cb889ce33438
MD5 60f0919d2fb6d876ecd37abd3d2177a0
BLAKE2b-256 c5b2e9d0a472ceb47de6b5b167b9096a1205c5a2782cce5038b714ab31244759

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page