Skip to main content

A benchmark collection for glycan property prediction with tasks from variuos, glycobiologically relevant tasks.

Project description

DOI testing

Glycan property prediction is an increasingly popular area of machine learning research. Supervised learning approaches have shown promise in glycan modeling; however, the current literature is fragmented regarding datasets and standardized evaluation techniques, hampering progress in understanding these complex, branched carbohydrates that play crucial roles in biological processes. To facilitate progress, we introduce GlycoGym, a comprehensive benchmark suite containing six biologically relevant supervised learning tasks spanning different domains of glycobiology: glycosylation linkage identification, tissue expression prediction, taxonomy classification, tandem mass spectrometry fragmentation prediction, lectin-glycan interaction modeling, and structural property estimation. We curate tasks into specific training, validation, and test splits using multi-class stratification to ensure that each task tests biologically relevant generalization that transfers to real-life glycan property prediction scenarios. GlycoGym will help the machine learning community to focus their efforts on scientifically relevant glycan prediction problems.

Installation

You can install GlycoGym via pip:

pip install glycogym

Usage

The main intention of this package is to build the benchmark for the upload to Zenodo, everytime the datasets with glycowork or GlyContact get significantly updated.

But one can also use it to build local versions of the benchmark during the update cycles of the Zenodo repository.

from glycogym import build_glycosylation, build_taxonomy, build_tissue, build_lgi

df, mapping = build_glycosylation()
df_taxonomy = build_taxonomy("Kingdom")
df_tissue = build_tissue()
df_r, df_cl, df_cg = build_lgi()

Tandem Mass Spectrometry Fragmentation Prediction

One special dataset is the MS fragmentation prediction dataset, which can be built as follows:

from glycogym import build_spectrum

df_ms = build_spectrum(root="path/to/folder/with/pkl/files")

Here, the root argument defined the path to the folder containing the .pkl files comprising the MS fragmentation prediction dataset by CandyCrunch, which can be downloaded from here.

Structural Property Estimation

The second dataset that requires special handling is the structural property estimation dataset. Currently, it needs to be build from the GlyContact package. That can be installed with the following command:

pip install -e git+https://github.com/lthomes/glycontact.git#egg=glycontact[ml]

Then, the dataset can be built as follows:

from glycontact.learning import create_dataset

train, val, test = create_dataset(splits=[0.7, 0.2, 0.1])

Zenodo

The latest version of the GlycoGym benchmark can be found on Zenodo: https://doi.org/10.5281/zenodo.17313055

Citation

tbd

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glycogym-1.0.1.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glycogym-1.0.1-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file glycogym-1.0.1.tar.gz.

File metadata

  • Download URL: glycogym-1.0.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.16

File hashes

Hashes for glycogym-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d88fae68e75f41c398156ce71e2e8fca014230cb4280602d94d01b7ed4a75ca1
MD5 3d0fc26201c26dbdf04d5d721b64944f
BLAKE2b-256 7b9cea48c0a1023339e42cddb6279c366993884c49a1c2187ac531a4e8a0f093

See more details on using hashes here.

File details

Details for the file glycogym-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: glycogym-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.16

File hashes

Hashes for glycogym-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1b4a4e040725cfbfdf3f9388b5ea1eee9212b6c513e2a4ed64957fb1aaf153bc
MD5 300ac64ce2f477d9887c32ee1b77c6e8
BLAKE2b-256 24fa29aac16b9463cc3159901462c0798a8715e9da2b2ab10e4ee8f6d4776298

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page