A benchmark collection for glycan property prediction with tasks from variuos, glycobiologically relevant tasks.
Project description
Glycan property prediction is an increasingly popular area of machine learning research. Supervised learning approaches have shown promise in glycan modeling; however, the current literature is fragmented regarding datasets and standardized evaluation techniques, hampering progress in understanding these complex, branched carbohydrates that play crucial roles in biological processes. To facilitate progress, we introduce GlycoGym, a comprehensive benchmark suite containing six biologically relevant supervised learning tasks spanning different domains of glycobiology: glycosylation linkage identification, tissue expression prediction, taxonomy classification, tandem mass spectrometry fragmentation prediction, lectin-glycan interaction modeling, and structural property estimation. We curate tasks into specific training, validation, and test splits using multi-class stratification to ensure that each task tests biologically relevant generalization that transfers to real-life glycan property prediction scenarios. GlycoGym will help the machine learning community to focus their efforts on scientifically relevant glycan prediction problems.
Installation
You can install GlycoGym via pip:
pip install glycogym
Usage
The main intention of this package is to build the benchmark for the upload to Zenodo, everytime the datasets with glycowork or GlyContact get significantly updated.
But one can also use it to build local versions of the benchmark during the update cycles of the Zenodo repository.
from glycogym import build_glycosylation, build_taxonomy, build_tissue, build_lgi
df, mapping = build_glycosylation()
df_taxonomy = build_taxonomy("Kingdom")
df_tissue = build_tissue()
df_r, df_cl, df_cg = build_lgi()
Tandem Mass Spectrometry Fragmentation Prediction
One special dataset is the MS fragmentation prediction dataset, which can be built as follows:
from glycogym import build_spectrum
df_ms = build_spectrum(root="path/to/folder/with/pkl/files")
Here, the root argument defined the path to the folder containing the .pkl files comprising the MS fragmentation prediction dataset by CandyCrunch, which can be downloaded from here.
Structural Property Estimation
The second dataset that requires special handling is the structural property estimation dataset. Currently, it needs to be build from the GlyContact package. That can be installed with the following command:
pip install -e git+https://github.com/lthomes/glycontact.git#egg=glycontact[ml]
Then, the dataset can be built as follows:
from glycontact.learning import create_dataset
train, val, test = create_dataset(splits=[0.7, 0.2, 0.1])
Zenodo
The latest version of the GlycoGym benchmark can be found on Zenodo: https://doi.org/10.5281/zenodo.17313055
Citation
tbd
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glycogym-1.0.1.tar.gz.
File metadata
- Download URL: glycogym-1.0.1.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d88fae68e75f41c398156ce71e2e8fca014230cb4280602d94d01b7ed4a75ca1
|
|
| MD5 |
3d0fc26201c26dbdf04d5d721b64944f
|
|
| BLAKE2b-256 |
7b9cea48c0a1023339e42cddb6279c366993884c49a1c2187ac531a4e8a0f093
|
File details
Details for the file glycogym-1.0.1-py3-none-any.whl.
File metadata
- Download URL: glycogym-1.0.1-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b4a4e040725cfbfdf3f9388b5ea1eee9212b6c513e2a4ed64957fb1aaf153bc
|
|
| MD5 |
300ac64ce2f477d9887c32ee1b77c6e8
|
|
| BLAKE2b-256 |
24fa29aac16b9463cc3159901462c0798a8715e9da2b2ab10e4ee8f6d4776298
|