Skip to main content

Large-scale generative pretrain of single cell using transformer.

Project description

scGPT

This is the official codebase for scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.

Preprint   Documentation   License   PyPI version

!UPDATE: We have released several new pretrained scGPT checkpoints. Please see the Pretrained scGPT checkpoints section for more details.

[2023.11.07] As requested by many, now we have made flash-attention an optional dependency. The pretrained weights can be loaded on pytorch CPU, GPU, and flash-attn backends using the same load_pretrained function, load_pretrained(target_model, torch.load("path_to_ckpt.pt")). An example usage is also here.

[2023.09.05] We have release a new feature for reference mapping samples to a custom reference dataset or to all the millions of cells collected from CellXGene! With the help of the faiss library, we achieved a great time and memory efficiency. The index of over 33 millions cells only takes less than 1GB of memory and the similarity search takes less than 1 second for 10,000 query cells on GPU. Please see the Reference mapping tutorial for more details.

Online apps

scGPT is now available at the following online apps as well, so you can get started simply with your browser!

Installation

scGPT works with Python >= 3.7.13 and R >=3.6.1. Please make sure you have the correct version of Python and R installed pre-installation.

scGPT is available on PyPI. To install scGPT, run the following command:

pip install scgpt "flash-attn<1.0.5"  # optional, recommended
# As of 2023.09, pip install may not run with new versions of the google orbax package, if you encounter related issues, please use the following command instead:
# pip install scgpt "flash-attn<1.0.5" "orbax<0.1.8"

[Optional] We recommend using wandb for logging and visualization.

pip install wandb

For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.

$ git clone this-repo-url
$ cd scGPT
$ poetry install

Note: The flash-attn dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.

Pretrained scGPT Model Zoo

Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well. A paired vocabulary file mapping gene names to ids is provided in each checkpoint folder. If ENSEMBL ids are needed, please find the conversion at gene_info.csv.

Model name Description Download
whole-human (recommended) Pretrained on 33 million normal human cells. link
brain Pretrained on 13.2 million brain cells. link
blood Pretrained on 10.3 million blood and bone marrow cells. link
heart Pretrained on 1.8 million heart cells link
lung Pretrained on 2.1 million lung cells link
kidney Pretrained on 814 thousand kidney cells link
pan-cancer Pretrained on 5.7 million cells of various cancer types link

Fine-tune scGPT for scRNA-seq integration

Please see our example code in examples/finetune_integration.py. By default, the script assumes the scGPT checkpoint folder stored in the examples/save directory.

To-do-list

  • Upload the pretrained model checkpoint
  • Publish to pypi
  • Provide the pretraining code with generative attention masking
  • Finetuning examples for multi-omics integration, cell type annotation, perturbation prediction, cell generation
  • Example code for Gene Regulatory Network analysis
  • Documentation website with readthedocs
  • Bump up to pytorch 2.0
  • New pretraining on larger datasets
  • Reference mapping example
  • Publish to huggingface model hub

Contributing

We greatly welcome contributions to scGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.

Acknowledgements

We sincerely thank the authors of following open-source projects:

Citing scGPT

@article{cui2023scGPT,
title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scGPT-0.2.1.tar.gz (809.6 kB view details)

Uploaded Source

Built Distribution

scgpt-0.2.1-py3-none-any.whl (829.2 kB view details)

Uploaded Python 3

File details

Details for the file scGPT-0.2.1.tar.gz.

File metadata

  • Download URL: scGPT-0.2.1.tar.gz
  • Upload date:
  • Size: 809.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0 CPython/3.7.12 Linux/5.4.0-146-generic

File hashes

Hashes for scGPT-0.2.1.tar.gz
Algorithm Hash digest
SHA256 bb6571b87dc7b379356351a102d24afb29165bb8967e3d8d0dcdfccd349abc6d
MD5 116b707b37d0b1209907339ac97aa569
BLAKE2b-256 b1c79ef17fbdbfa8881215906c954eef0d42518bda286eb3ee664054f78dc1dc

See more details on using hashes here.

File details

Details for the file scgpt-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: scgpt-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 829.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0 CPython/3.7.12 Linux/5.4.0-146-generic

File hashes

Hashes for scgpt-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a0b819c60c39bc96b35c529f527afaf538b4c7eb350fe7593bd09760c6d5cc57
MD5 2fc9505642371a31385553ca7067e3be
BLAKE2b-256 5bd00f9593f5cb5318b5b04ec126a7da8fbc80709e10d9b46f7f589ae837461c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page