A python package for glycan structure prediction from mass spectrometry data
Project description
glycoTrans
A repository for transformer-based models for glycan structure prediction from mass spectrometry data. Presently, the repository includes two models: GlycoBERT and GlycoBART.
(Ejas Althaf Abtheen, Arun Singh, Shyam Sriram, Changyou Chen, Sriram Neelamegham, Rudiyanto Gunawan bioRxiv 2025.07.02.662857; doi: https://doi.org/10.1101/2025.07.02.662857)
Overview
glycoTrans introduces state-of-the-art transformer architectures to glycomics, enhancing how we predict glycan structures from tandem mass spectrometry (MS/MS) data. By treating mass spectra and glycan structures as sequences, called MS and Glycan sentence, respectively, the models are trained to capture complex contextual relationships in spectral data.
Key Models
- GlycoBERT: A BERT-based sequence classifier for high-accuracy glycan structure prediction
- GlycoBART: A BART-based generative model capable of de novo glycan structure inference
Installation
From PyPI
pip install glycotrans
From GitHub
pip install git+https://github.com/CABSEL/glycotrans.git
Features
Superior Performance
- 95.1% accuracy on test datasets, outperforming state-of-the-art CNN-based methods like CandyCrunch
- Robust performance across diverse MS analysis parameters and glycan types
De Novo Discovery
- GlycoBART's generative capability enables prediction of novel glycan structures not present in training data
- Overcome database-dependent limitations
Transformer Architecture
- Self-attention mechanisms capture long-range dependencies in spectral data
- Bidirectional processing using BERT and BART for comprehensive context understanding
- Custom tokenization for MS spectra and glycan structures
Technical Innovation
MS Sentence Representation
Our novel approach converts MS/MS spectra into "MS sentences" containing:
- Experimental metadata (LC type, ion mode, fragmentation method, etc.)
- Normalized retention time
- Precursor m/z and fragment ion information
- Peak intensity encoding through positional embeddings
Glycan Sentence Format
Glycan structures are represented as sequences of constituent antennae:
- Terminal-to-core monosaccharide ordering
- Linkage information preservation
Model Architecture
GlycoBERT
- Base: BERT encoder with 12 transformer layers
- Parameters: 96 million trainable parameters
- Task: Multi-class classification (3,590 glycan classes)
- Attention: 12 attention heads per layer
- Embedding: 768-dimensional representations
GlycoBART
- Base: BART encoder-decoder architecture
- Parameters: 207 million trainable parameters
- Task: Conditional sequence generation
- Architecture: 12-layer encoder + 12-layer decoder
- Attention: 16 attention heads per layer
- Generation: Beam search with 32 beams
Performance Metrics
Accuracy Levels
- Mass Accuracy: Monoisotopic mass matching
- Composition Accuracy: Monosaccharide composition matching
- Topological Accuracy: Branching pattern recognition
- Structural Accuracy: Complete linkage-specific identification
Benchmark Results
| Model | Mass | Composition | Topology | Structure |
|---|---|---|---|---|
| GlycoBERT | 98.8% | 98.8% | 96.7% | 95.1% |
| GlycoBART (top-1) | 93.2% | 93.1% | 90.4% | 89.1% |
| GlycoBART (top-5) | 95.5% | 95.5% | 93.3% | 93.1% |
| CandyCrunch | 94.1% | 94.0% | 91.2% | 90.3% |
Example
Example_Inference.ipynb 
The Google colab notebook shows an example inference that can be modified for specific use cases.
The example file used in the notebook can be found at
Data and Models
Training data are available on Zenodo under .
The full version of GlycoBERT and GlycoBART are available on HuggingFace .
Funding
This work was supported by funding from NHLBI (HL103411)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glycotrans-0.1.0.tar.gz.
File metadata
- Download URL: glycotrans-0.1.0.tar.gz
- Upload date:
- Size: 238.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bfad97ef461ecaf1aef102498a7d6309fa84ceecaf4a12b93d0136d4bbe4c1d
|
|
| MD5 |
7e3e1aadb7aa70e243b17914696b5963
|
|
| BLAKE2b-256 |
4e2d606e469b4fff57356672c1379e66efa27d27c8316ff5ffd8d9666e7efb5d
|
File details
Details for the file glycotrans-0.1.0-py3-none-any.whl.
File metadata
- Download URL: glycotrans-0.1.0-py3-none-any.whl
- Upload date:
- Size: 236.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5928e86ec625ab50237be94ca9395f276507a5c9982a3d87beac566fe1f777e7
|
|
| MD5 |
1e8d385357030087d685216b50afaa52
|
|
| BLAKE2b-256 |
0b589e34dcda5fcf990e36a73ddfbe1e44d47d463b084ce3141282b764c4cd91
|