Skip to main content

Transformer of Epigenetics to Chromatin Structural AnnotationS

Project description

TECSAS

Transformer of Epigenetics to Chromatin Structural AnnotationS

Documentation | Tutorials | Installation

Overview

TECSAS (Transformer of Epigenetics to Chromatin Structural AnnotationS) is a deep learning model based on the Transformer architecture designed to predict chromatin subcompartment annotations directly from epigenomic data. TECSAS leverages information from histone modifications, transcription factor binding profiles, and RNA-seq data to decode the relationship between the biochemical composition of chromatin and its 3D structural behavior.

Chromatin within the nucleus adopts complex three-dimensional structures that are crucial for gene regulation and cellular function. Recent studies have revealed the presence of distinct chromatin subcompartments beyond the traditional A/B compartments (eu- and hetero-chromatin), each exhibiting unique structural and functional properties. TECSAS achieves high accuracy in predicting subcompartment annotations and reveals the influence of long-range epigenomic context on chromatin organization.

TECSAS Overview

The framework enables:

  • Chromatin subcompartment prediction: Classification of genomic regions into subcompartments (A1, A2, B1, B2, B3) at 25-50kb resolution
  • Nuclear body association prediction: Identification of lamina-associated domains (LADs), nucleolus-associated domains (NADs), and nuclear speckle-associated domains (SPADs)
  • Transfer learning: Pre-trained encoder on reference cell lines (e.g., GM12878) can be fine-tuned for target cell lines

TECSAS processes epigenomic signal tracks at specified genomic resolution (default 50kb bins), normalizes signals using z-score standardization, and uses sliding window context (default ±14 neighboring bins) to capture spatial dependencies. Unlike methods that rely on Hi-C contact maps, TECSAS predicts 3D genome organization directly from the epigenome, enabling analysis across diverse cell types without requiring proximity ligation experiments.

Usage

For complete examples, see the Tutorials directory.

Resources

  • Tutorials: Step-by-step notebooks in the Tutorials/ directory
    • Load_model_GM12878_155exp_50kbp.ipynb: Load and use pre-trained GM12878 subcompartment model
    • Test_GM12878_155exp_50kbp.ipynb: Evaluate the pre-trained GM12878 model with per-class accuracy and confusion matrices
    • Load_model_K562_124_exp_25kbp.ipynb: Load K562 model at 25kb resolution
    • train_and_predict_HistMod_example.ipynb: Training workflow using histone modifications
    • train_and_predict_XADS_HistMod_RNASeq.ipynb: Complete workflow for nuclear body association (LADs/NADs/SPADs) prediction using transfer learning
  • Pre-trained models: Model weights in TECSAS/share/models/
    • bv_GM12878_155.pt: GM12878 model trained with 155 experiments at 50kbp resolution (75.8% overall accuracy)
  • Reference data: Subcompartment annotations and nuclear body association labels (LADs, NADs, SPADs) in TECSAS/share/

Installation

Requirements

TECSAS requires Python 3.6+ and the following dependencies:

  • PyTorch (>=1.7.0)
  • NumPy (>=1.18)
  • pyBigWig
  • requests
  • joblib
  • tqdm
  • urllib3

Install from PyPI

pip install TECSAS

Install from source

Clone the repository and install:

git clone https://github.com/ed29rice/TECSAS.git
cd TECSAS
pip install -e .

Install dependencies

pip install torch numpy pyBigWig requests joblib tqdm urllib3

Note: For GPU acceleration, ensure you have CUDA-compatible PyTorch installed

Quick Start

Option A: Use pre-trained weights

Pre-trained model weights for GM12878 (155 experiments, 50kbp resolution) are included in TECSAS/share/models/. You can load and use them directly without retraining:

import torch
from TECSAS import TECSAS

# Model configuration matching the pre-trained weights
n_neighbors = 14   # Neighboring bins on each side (context window)
n_predict = 3      # Number of loci to predict
NEXP = 155         # Number of experiments in GM12878
nfeatures = NEXP * (2 * n_neighbors + 1)  # 155 * 29 = 4495

model = TECSAS(n_predict, emsize=128, nhead=8, d_hid=64, nlayers=2,
               nfeatures=nfeatures, ostates=5, dropout=0.01)

# Load pre-trained weights (keys have a 'module.' prefix from DataParallel)
state = torch.load('TECSAS/share/models/bv_GM12878_155.pt', map_location='cpu')
model.load_state_dict({'.'.join(k.split('.')[1:]): v for k, v in state.items()})
model.eval()

See Tutorials/Load_model_GM12878_155exp_50kbp.ipynb for a complete evaluation example.

Option B: Train from scratch

If you want to retrain the model on your own data or a different cell line:

  1. Import TECSAS:

    from TECSAS import data_process, TECSAS
    
  2. Download and process epigenomic data from ENCODE:

    dp = data_process(cell_line='GM12878', assembly='hg19', histones=True, tf=True)
    dp.download_and_process_cell_line_data(nproc=10)
    dp.download_and_process_ref_data(nproc=10)
    
  3. Generate training data:

    train, val, test, averages, indices = dp.training_data(n_neigbors=14, train_per=0.8)
    
  4. Initialize and train the model:

    model = TECSAS(n_predict=3, emsize=128, nhead=8, d_hid=64, nlayers=2,
                   nfeatures=NEXP*(2*14+1), ostates=5, dropout=0.01)
    # ... training loop (see Tutorials/train_and_predict_HistMod_example.ipynb)
    
  5. Make predictions on a target cell line:

    test_data = dp.test_set(chr=1)
    predictions = model(test_data, None)[0].argmax(dim=-1)
    

See the Tutorials/ directory for complete training and prediction workflows.

Citation

If you use TECSAS in your research, please cite:

Dodero-Rojas, E., Mendieta, A., Fehlis, Y., Mayala, N., Contessoto, V. G., & Onuchic, J. N. (2025). Epigenetics is all you need: A transformer to decode chromatin structural compartments from the epigenome. PLOS Computational Biology, 21(12), e1012326. https://doi.org/10.1371/journal.pcbi.1012326

License

TECSAS is released under the MIT License. See LICENSE for details.

Acknowledgments

This research was supported by the Center for Theoretical Biological Physics, sponsored by the NSF (Grants PHY-2019745 and PHY-2210291) and by the Welch Foundation (Grant C-1792). We thank AMD (Advanced Micro Devices, Inc.) for the donation of critical hardware and support resources from its HPC Fund that made this work possible.

Contact

For questions, issues, or collaborations, please open an issue on GitHub or contact the developers.


Copyright (c) 2020-2025 The Center for Theoretical Biological Physics (CTBP) - Rice University

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tecsas-1.0.1.tar.gz (64.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tecsas-1.0.1-py3-none-any.whl (65.0 MB view details)

Uploaded Python 3

File details

Details for the file tecsas-1.0.1.tar.gz.

File metadata

  • Download URL: tecsas-1.0.1.tar.gz
  • Upload date:
  • Size: 64.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tecsas-1.0.1.tar.gz
Algorithm Hash digest
SHA256 a0c9c69227a78c00f97aca595a3b94eb1235df8f9285664d56b4c98206d25668
MD5 837edf9158c0e199bef01d47c8ead718
BLAKE2b-256 f9841d7afb3ba1239f2370727e16c2acdf9a51700a227f68d7a1118f20bbfddf

See more details on using hashes here.

File details

Details for the file tecsas-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: tecsas-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 65.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tecsas-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 abb12b2c42ace302a0d48e0375fdf3134fe11a0d6d1b46891b6251ebaf038144
MD5 32cb52761bdb175fb374e4db6943d066
BLAKE2b-256 9f01069bdf0c05234caba984db82abec662e1af7a009d9500049fdfc2c0c6270

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page