Skip to main content

A super simple fast IMS predictor

Project description

Flimsay: Fun/Fast Simple IMS Anyone like You can use.

Sebastian Paez

version = 0.4.0

This repository implements a very simple LGBM model to predict ion mobility from peptides.

Usage

There are two main ways to interact with flimsay, one is using python and the other one is using the python api directly.

CLI

$ pip install flimsay
$ flimsay fill_blib mylibrary.blib # This will add ion mobility data to a .blib file.
! flimsay fill_blib --help
 Usage: flimsay fill_blib [OPTIONS] BLIB OUT_BLIB                               
                                                                                
 Add ion mobility prediction to a .blib file.                                   
                                                                                
╭─ Options ────────────────────────────────────────────────────────────────────╮
│ --overwrite      Whether to overwrite output file, if it exists              │
│ --help           Show this message and exit.                                 │
╰──────────────────────────────────────────────────────────────────────────────╯

Python

Single peptide

from flimsay.model import FlimsayModel

model_instance = FlimsayModel()
model_instance.predict_peptide("MYPEPTIDEK", charge=2)
{'ccs': array([363.36245907]), 'one_over_k0': array([0.92423264])}

Many peptides at once

import pandas as pd
from flimsay.features import add_features, FEATURE_COLUMNS

df = pd.DataFrame({
    "Stripped_Seqs": ["LESLIEK", "LESLIE", "LESLKIE"]
})
df = add_features(
    df,
    stripped_sequence_name="Stripped_Seqs",
    calc_masses=True,
    default_charge=2,
)
df
2023-07-25 21:14:45.792 | WARNING  | flimsay.features:add_features:163 - Charge not provided, using default charge of 2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
Stripped_Seqs StrippedPeptide PepLength NumBulky NumTiny NumProlines NumGlycines NumSerines NumPos PosIndexL PosIndexR NumNeg NegIndexL NegIndexR Mass PrecursorCharge PrecursorMz
0 LESLIEK LESLIEK 7 3 1 0 0 1 1 0.857143 0.000000 2 0.142857 0.142857 830.474934 2 416.245292
1 LESLIE LESLIE 6 3 1 0 0 1 0 1.000000 1.000000 2 0.166667 0.000000 702.379971 2 352.197811
2 LESLKIE LESLKIE 7 3 1 0 0 1 1 0.571429 0.285714 2 0.142857 0.000000 830.474934 2 416.245292
model_instance.predict(df[FEATURE_COLUMNS])
{'ccs': array([315.32424627, 306.70134752, 314.87268797]),
 'one_over_k0': array([0.78718781, 0.72658194, 0.78525451])}

Performance

Prediction Performance

Prediction Speed

Single peptide prediction

from flimsay.model import FlimsayModel

model_instance = FlimsayModel()

%timeit model_instance.predict_peptide("MYPEPTIDEK", charge=3)
174 µs ± 942 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In my laptop that takes 133 microseconds per peptide, or roughly 7,500 peptides per second.

Batch Prediction

# Lets make a dataset of 1M peptides to test
import random
import pandas as pd
from flimsay.features import calc_mass, mass_to_mz, add_features

random.seed(42)
AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY")
charges = [2,3,4]

seqs = [random.sample(AMINO_ACIDS, 10) for _ in range(1_000_000)]
charges = [random.sample(charges, 1)[0] for _ in range(1_000_000)]
seqs = ["".join(seq) for seq in seqs]
masses = [calc_mass(x) for x in seqs]
mzs = [mass_to_mz(m, c) for m, c in zip(masses, charges)]

df = pd.DataFrame({
    "Stripped_Seqs": seqs,
    "PrecursorCharge": charges,
    "Mass": masses,
    "PrecursorMz": mzs})
df = add_features(df, stripped_sequence_name="Stripped_Seqs")


# Now we get to run the prediction!
%timeit model_instance.predict(df)
20.6 s ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In my system every million peptides is predicted in 8.86 seconds, that is
113,000 per second.

Motivation

There is a fair amount of data on CCS and ion mobility of peptides but only very few models actually use features that are directly interpretable.

In addition, having a simpler model allows faster predictions in systems that are not equiped with GPUs.

Therefore, this project aims to create a freely available, easy to use, interpretable and fast model to predict ion mobility and collisional cross-section for peptides.

Features

The features used for prediction are meant to be simple and their implementation can be found here: flimsy/features.py

from flimsay.features import FEATURE_COLUMN_DESCRIPTIONS
for k,v in FEATURE_COLUMN_DESCRIPTIONS.items():
    print(f">>> The Feature '{k}' is defined as: \n\t{v}")
>>> The Feature 'PrecursorMz' is defined as: 
    Measured precursor m/z
>>> The Feature 'Mass' is defined as: 
    Measured precursor mass (Da)
>>> The Feature 'PrecursorCharge' is defined as: 
    Measured precursor charge, from the isotope envelope
>>> The Feature 'PepLength' is defined as: 
    Length of the peptide sequence in amino acids
>>> The Feature 'NumBulky' is defined as: 
    Number of bulky amino acids (LVIFWY)
>>> The Feature 'NumTiny' is defined as: 
    Number of tiny amino acids (AS)
>>> The Feature 'NumProlines' is defined as: 
    Number of proline residues
>>> The Feature 'NumGlycines' is defined as: 
    Number of glycine residues
>>> The Feature 'NumSerines' is defined as: 
    Number of serine residues
>>> The Feature 'NumPos' is defined as: 
    Number of positive amino acids (KRH)
>>> The Feature 'PosIndexL' is defined as: 
    Relative position of the first positive amino acid (KRH)
>>> The Feature 'PosIndexR' is defined as: 
    Relative position of the last positive amino acid (KRH)
>>> The Feature 'NumNeg' is defined as: 
    Number of negative amino acids (DE)
>>> The Feature 'NegIndexL' is defined as: 
    Relative position of the first negative amino acid (DE)
>>> The Feature 'NegIndexR' is defined as: 
    Relative position of the last negative amino acid (DE)

Training

Currently the training logic is handled using DVC (https://dvc.org).

git clone {this repo}
cd flimsay/train
dvc repro

Running this should automatically download the data, trian the models, calculate and update the metrics.

The current version of this repo uses predominantly the data from: - Meier, F., Köhler, N.D., Brunner, AD. et al. Deep learning the collisional cross sections of the peptide universe from a million experimental values. Nat Commun 12, 1185 (2021). https://doi.org/10.1038/s41467-021-21352-8

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flimsay-0.4.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

flimsay-0.4.0-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file flimsay-0.4.0.tar.gz.

File metadata

  • Download URL: flimsay-0.4.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for flimsay-0.4.0.tar.gz
Algorithm Hash digest
SHA256 dc3a7afdc8821b4aa15a86ef14d3be44892dbd6b22190eac4a74349132bf5bfc
MD5 70e516549ed80544c99bc1e57f6a6a00
BLAKE2b-256 aa83ebabf023889f5f8c73546178947251ff60975bd87a9ae32fec2ea4a7f32a

See more details on using hashes here.

File details

Details for the file flimsay-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: flimsay-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for flimsay-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 47a17d8aa91e0ae09bdcdd09599cef2dd90ddead3ed8e01bd3e8d22b326b737d
MD5 7d7d3fe8192e74c07b00bde35db5444c
BLAKE2b-256 be9de5045fdf3e323eba27f0c01a9daea230fb5bfe8c86bfde6046f7e19cf5e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page