Skip to main content

A python package for glycan structure prediction from mass spectrometry data

Project description

glycoTrans

A repository for transformer-based models for glycan structure prediction from mass spectrometry data. Presently, the repository includes two models: GlycoBERT and GlycoBART.

(Ejas Althaf Abtheen, Arun Singh, Shyam Sriram, Changyou Chen, Sriram Neelamegham, Rudiyanto Gunawan bioRxiv 2025.07.02.662857; doi: https://doi.org/10.1101/2025.07.02.662857)

Overview

glycoTrans introduces state-of-the-art transformer architectures to glycomics, enhancing how we predict glycan structures from tandem mass spectrometry (MS/MS) data. By treating mass spectra and glycan structures as sequences, called MS and Glycan sentence, respectively, the models are trained to capture complex contextual relationships in spectral data.

Key Models

  • GlycoBERT: A BERT-based sequence classifier for high-accuracy glycan structure prediction
  • GlycoBART: A BART-based generative model capable of de novo glycan structure inference

Installation

From PyPI

pip install glycotrans

From GitHub

pip install git+https://github.com/CABSEL/glycotrans.git

Features

Superior Performance

  • 95.1% accuracy on test datasets, outperforming state-of-the-art CNN-based methods like CandyCrunch
  • Robust performance across diverse MS analysis parameters and glycan types

De Novo Discovery

  • GlycoBART's generative capability enables prediction of novel glycan structures not present in training data
  • Overcome database-dependent limitations

Transformer Architecture

  • Self-attention mechanisms capture long-range dependencies in spectral data
  • Bidirectional processing using BERT and BART for comprehensive context understanding
  • Custom tokenization for MS spectra and glycan structures

Technical Innovation

MS Sentence Representation

Our novel approach converts MS/MS spectra into "MS sentences" containing:

  • Experimental metadata (LC type, ion mode, fragmentation method, etc.)
  • Normalized retention time
  • Precursor m/z and fragment ion information
  • Peak intensity encoding through positional embeddings

Glycan Sentence Format

Glycan structures are represented as sequences of constituent antennae:

  • Terminal-to-core monosaccharide ordering
  • Linkage information preservation

Model Architecture

GlycoBERT

  • Base: BERT encoder with 12 transformer layers
  • Parameters: 96 million trainable parameters
  • Task: Multi-class classification (3,590 glycan classes)
  • Attention: 12 attention heads per layer
  • Embedding: 768-dimensional representations

GlycoBART

  • Base: BART encoder-decoder architecture
  • Parameters: 207 million trainable parameters
  • Task: Conditional sequence generation
  • Architecture: 12-layer encoder + 12-layer decoder
  • Attention: 16 attention heads per layer
  • Generation: Beam search with 32 beams

Performance Metrics

Accuracy Levels

  1. Mass Accuracy: Monoisotopic mass matching
  2. Composition Accuracy: Monosaccharide composition matching
  3. Topological Accuracy: Branching pattern recognition
  4. Structural Accuracy: Complete linkage-specific identification

Benchmark Results

Model Mass Composition Topology Structure
GlycoBERT 98.8% 98.8% 96.7% 95.1%
GlycoBART (top-1) 93.2% 93.1% 90.4% 89.1%
GlycoBART (top-5) 95.5% 95.5% 93.3% 93.1%
CandyCrunch 94.1% 94.0% 91.2% 90.3%

Example

Example_Inference.ipynb Open In Colab

The Google colab notebook shows an example inference that can be modified for specific use cases.
The example file used in the notebook can be found at Google Drive

Data and Models

Training data are available on Zenodo under DOI.

The full version of GlycoBERT and GlycoBART are available on HuggingFace Sign in with Hugging Face.

Funding

This work was supported by funding from NHLBI (HL103411)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glycotrans-0.1.0.tar.gz (238.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glycotrans-0.1.0-py3-none-any.whl (236.4 kB view details)

Uploaded Python 3

File details

Details for the file glycotrans-0.1.0.tar.gz.

File metadata

  • Download URL: glycotrans-0.1.0.tar.gz
  • Upload date:
  • Size: 238.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for glycotrans-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0bfad97ef461ecaf1aef102498a7d6309fa84ceecaf4a12b93d0136d4bbe4c1d
MD5 7e3e1aadb7aa70e243b17914696b5963
BLAKE2b-256 4e2d606e469b4fff57356672c1379e66efa27d27c8316ff5ffd8d9666e7efb5d

See more details on using hashes here.

File details

Details for the file glycotrans-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: glycotrans-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 236.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for glycotrans-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5928e86ec625ab50237be94ca9395f276507a5c9982a3d87beac566fe1f777e7
MD5 1e8d385357030087d685216b50afaa52
BLAKE2b-256 0b589e34dcda5fcf990e36a73ddfbe1e44d47d463b084ce3141282b764c4cd91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page