Skip to main content

Foundation model for single-cell epigenomic data.

Project description

EpiAgent

Large-scale foundation models have recently opened new avenues for artificial general intelligence. Such a research paradigm has recently shown considerable promise in the analysis of single-cell sequencing data, while to date, efforts have centered on transcriptome. In contrast to gene expression, chromatin accessibility provides more decisive insights into cell states, shaping the chromatin regulatory landscapes that control transcription in distinct cell types. Yet, challenges also persist due to the abundance of features, high data sparsity, and the quasi-binary nature of these data. Here, we introduce EpiAgent, the first foundation model for single-cell epigenomic data, pretrained on a large-scale Human-scATAC-Corpus comprising approximately 5 million cells and 35 billion tokens. EpiAgent encodes chromatin accessibility patterns of cells as concise “cell sentences,” and employs bidirectional attention to capture cellular heterogeneity behind regulatory networks. With comprehensive benchmarks, we demonstrate that EpiAgent excels in typical downstream tasks, including unsupervised feature extraction, supervised cell annotation, and data imputation. By incorporating external embeddings, EpiAgent facilitates the prediction of cellular responses to both out-of-sample stimulated and unseen genetic perturbations, as well as reference data integration and query data mapping. By simulating the knockout of key cis-regulatory elements, EpiAgent enables in-silico treatment for cancer analysis. We further extended zero-shot capabilities of EpiAgent, allowing direct cell type annotation on newly sequenced datasets without additional training.

image


Updates / News

  • 2024.12.21: Our paper was published on bioRxiv. Read the preprint here.
  • 2024.12.27: Source code and Python package released on PyPI under the name epiagent (v0.0.1). Install it via pip install epiagent.
  • 2024.12.28: Updated GitHub repository with pretrained EpiAgent model and two supervised models for cell type annotation: EpiAgent-B and EpiAgent-NT. Models and example datasets can be downloaded from Google Drive. Additionally, we added usage demos for zero-shot applications (link).
  • 2025.02.12: Updated the epiagent PyPI package to version 0.0.2, adding fine-tuning code for unsupervised feature extraction and supervised cell type annotation. We also provided demos of the fine-tuning code, available here.
  • 2025.03.03: Updated the epiagent PyPI package to version 0.0.3. This release includes new fine-tuning code for: a) data imputation, b) reference data integration and query data mapping, and c) cellular response prediction of out-of-sample stimulated perturbation. In addition, several bugs in the previous version have been fixed. Demo notebooks for fine-tuning EpiAgent for data imputation and for reference data integration and query data mapping are available here.

Installation

Environment Setup

EpiAgent is built on the PyTorch 2.0 framework with FlashAttention v2. We recommend using CUDA 11.7 for optimal performance.

Step 1: Set up a Python environment

We recommend creating a virtual Python environment with Anaconda:

$ conda create -n EpiAgent python=3.11
$ conda activate EpiAgent

Step 2: Install Pytorch

Install PyTorch based on your system configuration. Refer to PyTorch installation instructions for the exact command. For example:

$ pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 # torch 2.0.1 + cuda 11.7

Step 3: Install FlashAttention (if not already installed)

Install flash-attn by following the instructions below (adapted from the FlashAttention GitHub repository):

  1. FlashAttention uses ninja to compile its C++/CUDA components efficiently. Check if ninja is already installed and working correctly:、:
$ ninja --version
$ echo $?

If the above commands return a nonzero exit code or you encounter errors, reinstall ninja to ensure it works properly:

$ pip uninstall -y ninja && pip install ninja
  1. Install FlashAttention:

After ensuring ninja is installed, proceed with the FlashAttention installation. Use the following command to install a compatible version:

$ pip install flash-attn==2.5.8 --no-build-isolation

Step 4: Install EpiAgent and dependencies

To install EpiAgent, run:

$ pip install epiagent

Data Preprocessing

EpiAgent uses a unified set of candidate cis-regulatory elements (cCREs) as features. We recommend starting from fragment files to process input data compatible with EpiAgent. The preprocessing steps include:

  1. Reference Genome Conversion (Optional):

    • Our cCRE coordinates are based on hg38. If your fragment files use hg19, use liftOver to convert them to hg38.
  2. Fragment Overlap Calculation:

    • Use bedtools to calculate overlaps between fragments and cCREs.
  3. Cell-by-cCRE Matrix Construction:

    • Use epiagent.preprocessing.construct_cell_by_ccre_matrix to create the cell-by-cCRE matrix and add metadata.
  4. TF-IDF and Tokenization:

    • Perform global TF-IDF to assign importance to accessible cCREs, followed by tokenization to generate cell sentences.

For a detailed example, refer to the demo notebook: Data Preprocessing.ipynb.


Downstream Analysis

Zero-shot unsupervised feature extraction with the pretrained EpiAgent model

Fine-tuning EpiAgent for unsupervised feature extraction

Fine-tuning EpiAgent for supervised cell type annotation

Data Imputation

Reference Data Integration and Query Data Mapping

Zero-shot cell type annotation with EpiAgent-B and EpiAgent-NT

Two supervised models, EpiAgent-B and EpiAgent-NT, are designed for direct cell type annotation. These models and their example datasets can be downloaded here. For specific demos:

Other tasks

  • Prediction of Cellular Responses to Stimulations and Genetic Perturbations
  • In-silico Treatment Simulations

Fine-tuning and additional code demos will be updated soon.


Citation

If you use EpiAgent in your research, please cite our paper:

Chen X, Li K, Cui X, Wang Z, Jiang Q, Lin J, Li Z, Gao Z, Jiang R. EpiAgent: Foundation model for single-cell epigenomic data. bioRxiv. 2024:2024-12.


Contact

For questions about the paper or code, please email: xychen20@mails.tsinghua.edu.cn

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epiagent-0.0.3.tar.gz (31.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epiagent-0.0.3-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file epiagent-0.0.3.tar.gz.

File metadata

  • Download URL: epiagent-0.0.3.tar.gz
  • Upload date:
  • Size: 31.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.18

File hashes

Hashes for epiagent-0.0.3.tar.gz
Algorithm Hash digest
SHA256 91c374d247aa6de5f1e8ae0f03e2547da46a88036b481e922b5612da9540eb27
MD5 1adae4f29bb56d774d323584883438b2
BLAKE2b-256 b39f9223aa2844d8089bb046169dad6c1ca96665159c3369b656f6ce470742e9

See more details on using hashes here.

File details

Details for the file epiagent-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: epiagent-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.18

File hashes

Hashes for epiagent-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fb51454184865c5f0dd7027eead41e0f07b14be2e125176b5271f504cdeec39e
MD5 b7ecfb79c2c3f3e5c40f0bb04afb668b
BLAKE2b-256 7fbdf0daf32ac71cea09d671f5a538394d9f873c3fa4fa9c62c7bf5886befba9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page