Skip to main content

A python package for glycan structure prediction from mass spectrometry data

Project description

glycoTrans

A repository for transformer-based models for glycan structure prediction from mass spectrometry data. Presently, the repository includes two models: GlycoBERT and GlycoBART.

(Ejas Althaf Abtheen, Arun Singh, Shyam Sriram, Changyou Chen, Sriram Neelamegham, Rudiyanto Gunawan bioRxiv 2025.07.02.662857; doi: https://doi.org/10.1101/2025.07.02.662857)

Overview

glycoTrans introduces state-of-the-art transformer architectures to glycomics, enhancing how we predict glycan structures from tandem mass spectrometry (MS/MS) data. By treating mass spectra and glycan structures as sequences, called MS and Glycan sentence, respectively, the models are trained to capture complex contextual relationships in spectral data.

Key Models

  • GlycoBERT: A BERT-based sequence classifier for high-accuracy glycan structure prediction
  • GlycoBART: A BART-based generative model capable of de novo glycan structure inference

Installation

From PyPI

pip install glycotrans

From GitHub

pip install git+https://github.com/CABSEL/glycotrans.git

Features

Superior Performance

  • 95.1% accuracy on test datasets, outperforming state-of-the-art CNN-based methods like CandyCrunch
  • Robust performance across diverse MS analysis parameters and glycan types

De Novo Discovery

  • GlycoBART's generative capability enables prediction of novel glycan structures not present in training data
  • Overcome database-dependent limitations

Transformer Architecture

  • Self-attention mechanisms capture long-range dependencies in spectral data
  • Bidirectional processing using BERT and BART for comprehensive context understanding
  • Custom tokenization for MS spectra and glycan structures

Technical Innovation

MS Sentence Representation

Our novel approach converts MS/MS spectra into "MS sentences" containing:

  • Experimental metadata (LC type, ion mode, fragmentation method, etc.)
  • Normalized retention time
  • Precursor m/z and fragment ion information
  • Peak intensity encoding through positional embeddings

Glycan Sentence Format

Glycan structures are represented as sequences of constituent antennae:

  • Terminal-to-core monosaccharide ordering
  • Linkage information preservation

Model Architecture

GlycoBERT

  • Base: BERT encoder with 12 transformer layers
  • Parameters: 96 million trainable parameters
  • Task: Multi-class classification (3,590 glycan classes)
  • Attention: 12 attention heads per layer
  • Embedding: 768-dimensional representations

GlycoBART

  • Base: BART encoder-decoder architecture
  • Parameters: 207 million trainable parameters
  • Task: Conditional sequence generation
  • Architecture: 12-layer encoder + 12-layer decoder
  • Attention: 16 attention heads per layer
  • Generation: Beam search with 32 beams

Performance Metrics

Accuracy Levels

  1. Mass Accuracy: Monoisotopic mass matching
  2. Composition Accuracy: Monosaccharide composition matching
  3. Topological Accuracy: Branching pattern recognition
  4. Structural Accuracy: Complete linkage-specific identification

Benchmark Results

Model Mass Composition Topology Structure
GlycoBERT 98.8% 98.8% 96.7% 95.1%
GlycoBART (top-1) 93.2% 93.1% 90.4% 89.1%
GlycoBART (top-5) 95.5% 95.5% 93.3% 93.1%
CandyCrunch 94.1% 94.0% 91.2% 90.3%

Example

Example_Inference.ipynb Open In Colab

The Google colab notebook shows an example inference that can be modified for specific use cases.
The example file used in the notebook can be found at Google Drive

Data and Models

Training data are available on Zenodo under DOI.

The full version of GlycoBERT and GlycoBART are available on HuggingFace Sign in with Hugging Face.

Funding

This work was supported by funding from NHLBI (HL103411)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glycotrans-0.1.1.tar.gz (144.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glycotrans-0.1.1-py3-none-any.whl (142.0 kB view details)

Uploaded Python 3

File details

Details for the file glycotrans-0.1.1.tar.gz.

File metadata

  • Download URL: glycotrans-0.1.1.tar.gz
  • Upload date:
  • Size: 144.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for glycotrans-0.1.1.tar.gz
Algorithm Hash digest
SHA256 40dcfa202c2aee2476e1b2952f14e082092cec173847df40248643d375801c9f
MD5 76c682375ab5bb7c06ec239833fd3272
BLAKE2b-256 ba8afa45453ca9bb0e5258b08e46e688a250abfb6815f6505cd5493c433075fd

See more details on using hashes here.

File details

Details for the file glycotrans-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: glycotrans-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 142.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for glycotrans-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dd4c00953b0307b0b4e59f708cd1b7ae9e24f3027bb77c3da11d558eaa784029
MD5 34c22ab7fb970803f00d02d4fca44273
BLAKE2b-256 09f7ef34f1e67509f60fe606059959cfc4965fd230113247d0c72bb39a54f087

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page