Skip to main content

No project description provided

Project description

MethylGPT_clean

This is the official codebase for methylGPT : a foundation model for the DNA methylome.

Preprint  

Documentation   PyPI version   #Downloads   #Webapp   #License

!UPDATE: [2025.02.10] methylGPT is now available on PyPI [2024.12.10] We made initial launching of the methylGPT codebase. [2025.11.04] Manuscript available on arXiv

Installation

Architecture Note : MethylGPT's backend architecture is largely based on scGPT, developed by the Wang Lab. As such, our project inherits and follows similar dependencies and architectural patterns. We acknowledge and thank the scGPT team for their foundational work.

methylGPT works with Python >= 3.9.10 and R >=3.6.1. Please make sure you have the correct version of Python and R installed pre-installation.

methylGPT is available on PyPI. To install methylGPT, run the following command:

pip install methylgpt "flash-attn<1.0.5"  # optional, recommended
# As of 2023.09, pip install may not run with new versions of the google orbax package, if you encounter related issues, please use the following command instead:
# pip install scgpt "flash-attn<1.0.5" "orbax<0.1.8"

[Optional] We recommend using wandb for logging and visualization.

pip install wandb

For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.

$ git clone this-repo-url
$ cd MethylGPT_clean
$ poetry install

Note: The flash-attn dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.

Running pretraining

The primary pretraining code is implemented in methylgpt.pretraining.py. During training, model checkpoints are automatically saved to the save/ directory at the end of each epoch.

For a detailed walkthrough of the pretraining process, refer to our step-by-step examples in the pretraining tutorials.

(TODO) Pretrained methylGPT Model Zoo

Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well. A paired vocabulary file mapping gene names to ids is provided in each checkpoint folder. If ENSEMBL ids are needed, please find the conversion at gene_info.csv.

Model name Description Download
whole-human (recommended) Pretrained on 33 million normal human cells. link
continual pretrained For zero-shot cell embedding related tasks. link
brain Pretrained on 13.2 million brain cells. link
blood Pretrained on 10.3 million blood and bone marrow cells. link
heart Pretrained on 1.8 million heart cells link
lung Pretrained on 2.1 million lung cells link
kidney Pretrained on 814 thousand kidney cells link
pan-cancer Pretrained on 5.7 million cells of various cancer types link

Fine-tune methylGPT for age prediction

Please see our example code in tutorials/finetuning_age_prediction. By default, the script assumes the scGPT checkpoint folder stored in the examples/save directory.

To-do-list

  • Upload the pretrained model checkpoint
  • Publish to pypi
  • Provide the pretraining code with generative attention masking
  • More tutorial examples for disease prediction
  • Publish to huggingface model hub

Contributing

We greatly welcome contributions to methylGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.

Acknowledgements

We sincerely thank the authors of following open-source projects:

Citing scGPT

@article{ying2024methylgpt,
  title={MethylGPT: a foundation model for the DNA methylome},
  author={Ying, Kejun and Song, Jinyeop and Cui, Haotian and Zhang, Yikun and Li, Siyuan and Chen, Xingyu and Liu, Hanna and Eames, Alec and McCartney, Daniel L and Marioni, Riccardo E and others},
  journal={bioRxiv},
  pages={2024--10},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

methylgpt-0.1.2.tar.gz (12.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

methylgpt-0.1.2-py3-none-any.whl (12.3 MB view details)

Uploaded Python 3

File details

Details for the file methylgpt-0.1.2.tar.gz.

File metadata

  • Download URL: methylgpt-0.1.2.tar.gz
  • Upload date:
  • Size: 12.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.9.10 Linux/6.1.0-23-cloud-amd64

File hashes

Hashes for methylgpt-0.1.2.tar.gz
Algorithm Hash digest
SHA256 35b7ade708bc2871af66773cb6faae8f7acb025d03c4de6242f18867c9920558
MD5 f7a143df86cfe0b109c880ae04f358d9
BLAKE2b-256 37d40140caacf3075d42497481264b2995eadc0757bdf2cac9b7e83ee1e1340c

See more details on using hashes here.

File details

Details for the file methylgpt-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: methylgpt-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.9.10 Linux/6.1.0-23-cloud-amd64

File hashes

Hashes for methylgpt-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3a2185963b84495b008d1ec731b3d8a2fb76e6b956f3c2b28391187085b9d76c
MD5 76b227703170598596715900e2578c0c
BLAKE2b-256 cbf0133e7501506c223f88cdb5c727a14c2649f7244c3040dbf98bd36ee4ea0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page