Skip to main content

A python package for glycan structure prediction from mass spectrometry data

Project description

GlycoTrans

GlycoTrans is a package for predicting glycan structures from LC-MS/MS data. The package provides an inference pipeline along with utilities required for glycan structure prediction using our GlycoBERT and GlycoBART transformer-based deep learning models. For more details on the models and how we trained them, please refer to our manuscript.

Installation

From PyPI

pip install glycotrans

From GitHub

pip install git+https://github.com/CABSEL/glycotrans.git

We also offer a user-friendly Google Colaboratory notebook that allows you to run GlycoTrans without any local installation. The notebook contains a ready-to-use example workflow, which you can easily copy, run, and adapt to your specific needs. Open In Colab

The 21658_Moon_20230505_90_MW055_CALNx1_002.mzML file used in the notebook can be found at

Google Drive

Usage

glycobert_inference() from utils.py

Wrapper function for glycan structure inference from LC-MS/MS data using the GlycoBERT model.

Required Arguments:

filepath (string): path to the .mzML or .mzXML LC-MS/MS spectral file

Optional Arguments:

- vocab_path (string, default = 'vocab_glycobert.json'): path to the GlycoBERT vocabulary file
- modelDir (string, default = 'CABSEL/glycobert'): path to the trained GlycoBERT model directory
- batch_size (int, default = 256): number of spectra to process during each batch
- filename (string, default = 'unspecified'): name of the .mzML or .mzXML MS/MS file
- lc (string, default = 'PGC'): type of liquid chromatography (LC) used; options: 'PGC', 'C18', 'HILIC', 'MGC', 'other_lc' (if LC type is  unknown or outside the given options)
- mode (string, default = 'negative'): type of ion mode used; options: 'negative', 'positive', 'other_mode' (if mode is unknown or outside the given options)
- modification (string, default = 'reduced'): type of glycan derivatization; options: 'reduced', 'permethylated', '2AA', 'PA', 'native', 'Rapifluor', 'other_mod' (if custom modification) 
- glycan_type (string, default = 'N'): type of glycan class; options: 'O', 'N', 'lipids', 'free', 'other_type' (if glycan type is unknown or outside the given options)
- trap (string, default = 'linear'): type of ion trap used; options: 'linear', 'orbitrap', 'amazon', 'MSD', 'TOF', 'octopole', 'other_trap' (if trap is unknown or outside the given options)
- ionization (string, default = 'other_ion'): type of ionization used; options: 'ESI', 'MALDI', 'other_ion' (if ionization is unknown or outside the given options)
- fragmentation (string, default = 'CID'): type of fragmentation used; options: 'CID', 'HCD', 'other_frag' (if fragmentation is unknown or outside the given options)
- taxonomy_level (string, default = 'Class'): taxonomic classification level to consider from df_use
- taxonomy_filter (string, default = 'Mammalia'): specific taxonomic Class of glycans to consider from df_use
- df_use (DataFrame, default = None): glycan database with known glycan structures, taxonomy_level, etc. By default, the df_glycan database from Glycowork package is used
- mass_tag (float, default = None): custom modification mass. Set modification to 'other_mod' if using custom mofification mass
- filter_out (set, default = {'Ac','Kdn', 'P', 'HexA', 'Pen', 'HexN', 'Me', 'PCho', 'PEtN'}): set of monosaccharide or modification types that is used to filter out compositions
- glycan_pkl (string, default = 'glycan_classes.pkl'): filepath to glycan classes used in GlycoBERT training
- device (string, default = 'cpu'): type of computing device used; options: 'cpu', 'cuda'

Output Arguments

df_out (DataFrame): dataframe containing predicted glycan structure, composition, etc.

Example Usage

df_out = glycobert_inference(filepath = 'C:\files\21658_Moon_20230505_90_MW055_CALNx1_002.mzML', mode='positive', modification='reduced', glycan_type='O')

filter_glycans_glycobert() from utils.py

Wrapper function for the downstream processing of glycan structures predicted by glycobert_inference(). The predicted glycan structures are retained or removed based on the quality control filters such as precursor mass, diagnostic ions, etc. Run this function after running the glycobert_inference() function.

Required Arguments:

df_out (DataFrame): output dataframe from the glycobert_inference function

Optional Arguments:

- pred_thresh (int, default = 0.01): prediction confidence threshold used for filtering. Glycan structures with prediction confidence below pred_thresh are removed
- frag_threshold (int, default = 1): fraction of MS/MS peaks to consider. MS/MS peaks are sorted in the descending order of their intensity before filtering. frag_threshold ranges from 0 to 1
- MS1_ppm (int, default = 10): MS1 mass tolerance in ppm
- MS2_tolerance (int, default = 0.5): MS2 mass tolerance in Da
- annotation_thresh (int, default =3): threshold for number of ion matches
- filter_diag_ion (list, default = ['Neu5Gc', 'Kdn', 'S']): list of monosaccharides not expected in the prediction. Glycan structures with the listed monosaccharides will be removed

see glycobert_inference() for the remaining arguments

Output Arguments

- df_before_deduplication (DataFrame): dataframe containing glycan predictions from GlycoBERT model after all the quality filters
- df_filtered (DataFrame): dataframe after removing duplicate glycan predictions from df_before_deduplication dataframe

Example Usage

df_filtered, df_before_deduplication = filter_glycans_glycobert(df_out)

glycobart_inference() from utils.py

Wrapper function for glycan structure inference from LC-MS/MS data using the GlycoBART model.

Required Arguments:

filepath (string): path to the .mzML or .mzXML LC-MS/MS spectral file

Optional Arguments:

- vocab_path (string, default = 'vocab_glycobart.json'): path to the GlycoBART vocabulary file
- modelDir (string, default = 'CABSEL/glycobart'): path to the trained GlycoBART model directory
- num_beam (int, default = 32): number of glycan structures to consider during GlycoBART inference
- num_return (int, default = 32): number of glycan structures to return from GlycoBART inference. Should not be greater than num_beam

see glycobert_inference() for the remaining arguments

Output Arguments

df_out (DataFrame): dataframe containing predicted glycan structure, composition, etc.

Example Usage

df_out = glycobart_inference(filepath = 'C:\files\21658_Moon_20230505_90_MW055_CALNx1_002.mzML' , mode='positive', modification='reduced', glycan_type='O', num_beam = 4, num_return = 4)

filter_glycans_glycobart() from utils.py

Wrapper function for the downstream processing of glycan structures predicted by glycobart_inference(). The predicted glycan structures are retained or removed based on the quality control filters such as precursor mass, diagnostic ions, etc. Run this function after running the glycobart_inference() function.

Required Arguments:

df_out (DataFrame): output dataframe from the glycobart_inference function

Optional Arguments:

see glycobert_inference() and filter_glycans_glycobert()

Output Arguments

- df_before_deduplication (DataFrame): dataframe containing glycan predictions from GlycoBART model after all the quality filers
- df_filtered (DataFrame): dataframe after removing duplicate glycan predictions from df_before_deduplication dataframe

Example Usage

df_filtered, df_before_deduplication = filter_glycans_glycobart(df_out)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glycotrans-0.1.2.tar.gz (145.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glycotrans-0.1.2-py3-none-any.whl (142.4 kB view details)

Uploaded Python 3

File details

Details for the file glycotrans-0.1.2.tar.gz.

File metadata

  • Download URL: glycotrans-0.1.2.tar.gz
  • Upload date:
  • Size: 145.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for glycotrans-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2ce2e3d70bab20719fae50332b3e16ce1de2def8219e8d14f915730ebfe64a7d
MD5 a39375def9f04f347fa0daa5fbb4b886
BLAKE2b-256 1500d9f830eeedd07bf5a6ed0dbadf2e93e7550e5a87af6fd4a4aa4fd248da48

See more details on using hashes here.

File details

Details for the file glycotrans-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: glycotrans-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 142.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for glycotrans-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f6c28f47e3a64589db6098168552dfd1813c139c4a290a0ce118e0608e8affb3
MD5 e15bbc93350ce2ddc28ccf66214f07d7
BLAKE2b-256 8a52634c830c23e37c039e6963422931a1ceb6d066dc38eb45ae9c6153b70085

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page