Hi-Compass: Depth-aware deep learning framework for cell-type-specific chromatin interaction prediction from ATAC-seq
Project description
Hi-Compass
Hi-Compass is a depth-aware multi-modal deep learning framework for predicting cell-type-specific chromatin interactions from ATAC-seq data.
Table of Contents
- Overview
- Key Features
- Training Data Preparation
- Preprocessing
- Training
- Prediction
- Acknowledgments
- License
- Contact
Overview
Three-dimensional genome organization controls cell-type-specific gene expression through chromatin interactions. Hi-Compass addresses the challenge of predicting 3D genome structure from chromatin accessibility data by integrating four key inputs:
- ATAC-seq signal
- ATAC-seq sequencing depth
- DNA sequence (fixed input, provided for hg38 and mm10)
- Generalized CTCF binding profile (fixed input, provided for hg38 and mm10)
The model incorporates a depth-aware module that dynamically accommodates sequencing depth variations, enabling robust predictions across the full spectrum of data resolution—from sparse single-cell to high-coverage bulk profiles. Hi-Compass predicts Hi-C contact matrices at 10 kb resolution with 2 Mb window size.
Key Features
-
Cell-type-specific Hi-C prediction requiring only ATAC-seq as user-provided input; all other reference data (DNA sequence, generalized CTCF, etc.) are provided or reusable across samples
-
Complete preprocessing pipeline and training functionality, allowing users to train models on their own datasets
-
Support for human (hg38), mouse (mm10), and other species (custom species require user-provided generalized CTCF data)
-
Direct output of balanced .cool files compatible with downstream analysis tools such as cooltools, HiGlass, and Juicebox
-
Installation
Step 1: Install PyTorch
Hi-Compass requires PyTorch but does not install it automatically, as the correct version depends on your system and CUDA configuration.
Please install PyTorch first following the instructions at pytorch.org.
Step 2: Install Hi-Compass
pip install hicompass
Typical installation time: ~5 minutes on a standard desktop computer.
Step 3: Install training dependencies (for training only)
If you plan to train your own models, please install the PyTorch Lightning with version that match your torch.
Step 4: Install preprocessing tools (for preprocessing only)
The preprocessing commands require the following external tools:
# Using conda conda install -c bioconda samtools bedtools ucsc-bedgraphtobigwig
Training Data Preparation
Training Required Data
Hi-Compass requires the following input data:
| Data Type | User Provided | Pre-built Available | Description |
|---|---|---|---|
| ATAC-seq | Yes | - | Cell-type-specific BAM file (for training) or BigWig file (for prediction) |
| Hi-C | Yes (training only) | - | 10 kb resolution .cool file |
| DNA sequence | No | hg38, mm10 | One-hot encoded sequences per chromosome |
| Generalized CTCF | No | hg38, mm10 | Pan-tissue CTCF binding profile |
| Chromosome sizes | No | hg38, mm10 | Chromosome size file |
| Centromere regions | No | hg38, mm10 | BED file for filtering (optional) |
Download Pre-built Reference Data
Pre-built reference data for human (hg38) are available for download:
Required for Prediction
| File | Description | Download |
|---|---|---|
| DNA sequences | One-hot encoded DNA sequences per chromosome | DNA.zip |
| Generalized CTCF | Pan-tissue CTCF binding profile | CTCF.zip |
| Model weights | Pre-trained Hi-Compass model | model_weights.zip |
| Centromere regions | BED file for filtering (optional) | centromere.zip |
Required for Training Only
| File | Description | Download |
|---|---|---|
| ATAC-seq BAM | Example bulk ATAC-seq data (GM12878, IMR90) | IMR90_GM12878_ATAC_bam.zip |
| Hi-C cool | Example Hi-C matrices (GM12878, IMR90) | IMR90_GM12878_HiC_cool.zip |
Download and organize the files according to the directory structure below.
Directory Structure
Hi-Compass expects data organized in the following structure for training (take hg38 as example):
/home/user/hicompass_data/
├── ATAC/
│ └── hg38/
│ ├── GM12878~ATAC~bulk.bw
│ ├── GM12878~ATAC~1e6.bw
│ ├── GM12878~ATAC~5e5.bw
│ ├── K562~ATAC~bulk.bw
│ ├── K562~ATAC~1e6.bw
│ └── K562~ATAC~5e5.bw
├── HiC/
│ └── hg38/
│ ├── GM12878/
│ │ ├── chr1.npz
│ │ ├── chr2.npz
│ │ └── ...
│ └── K562/
│ ├── chr1.npz
│ ├── chr2.npz
│ └── ...
├── DNA/
│ └── hg38/
│ ├── chr1.fa.gz
│ ├── chr2.fa.gz
│ └── ...
├── CTCF/
│ └── hg38/
│ └── generalized_CTCF.bw
└── centromere/
└── hg38/
└── centromere.bed
ATAC-seq BigWig files follow the naming convention:
{CellType}~ATAC~{Depth}.bw
Examples:
GM12878~ATAC~bulk.bw- Bulk ATAC-seq (full depth)GM12878~ATAC~1e6.bw- Subsampled to 1,000,000 readsGM12878~ATAC~5e5.bw- Subsampled to 500,000 reads
This naming convention is automatically generated by the preprocessing pipeline and used by the training module to extract cell type and depth information.
Preprocessing
Hi-Compass provides preprocessing commands to prepare your data for training.
ATAC-seq Preprocessing
Convert bulk ATAC-seq BAM files to multi-depth BigWig files through stratified subsampling:
hicompass preprocess-atac \
--input GM12878_bulk_ATAC.bam \
--cell-type GM12878 \
--output data/ATAC/hg38 \
--chrom-sizes hg38.chrom.sizes
This command generates BigWig files at multiple sequencing depths, enabling the model to learn depth-aware representations.
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
--input, -i |
Yes | - | Input BAM/SAM file |
--cell-type, -c |
Yes | - | Cell type name (used in output filenames) |
--output, -o |
Yes | - | Output directory |
--chrom-sizes, -s |
Yes | - | Chromosome sizes file |
--depths, -d |
No | - | Custom depths (comma-separated or @file.txt) |
--min-depth |
No | 2e5 | Minimum depth for range mode |
--max-depth |
No | 2e7 | Maximum depth for range mode |
--step |
No | 2e4 | Step size for range mode |
--no-bulk |
No | False | Skip bulk BigWig generation |
--seed |
No | 42 | Random seed for reproducibility |
Output
The command generates BigWig files following the naming convention {CellType}~ATAC~{Depth}.bw:
data/ATAC/hg38/
├── GM12878~ATAC~bulk.bw
├── GM12878~ATAC~2e5.bw
├── GM12878~ATAC~2.2e5.bw
├── GM12878~ATAC~2.4e5.bw
└── ...
Hi-C Preprocessing
Hi-C preprocessing consists of two steps:
Step 1: Contrast Stretching Normalization
Apply contrast stretching to enhance Hi-C matrix features:
hicompass preprocess-hic-norm \
--input GM12878_raw.cool \
--output GM12878_normalized.cool \
--genome hg38
| Parameter | Required | Default | Description |
|---|---|---|---|
--input, -i |
Yes | - | Input cool file |
--output, -o |
Yes | - | Output cool file |
--genome, -g |
No | hg38 | Genome assembly (hg38, mm10, or custom) |
--chrom-sizes, -s |
No | - | Required when --genome=custom |
--resolution, -r |
No | 10000 | Resolution in bp (10 kb recommended) |
--percentile-max |
No | 98.0 | Upper percentile for contrast stretching |
Note: Only 10 kb resolution is currently supported. A warning will be issued for other resolutions.
Step 2: Convert to NPZ Format
Convert the normalized cool file to NPZ format for training:
hicompass preprocess-hic-to-npz \
--input GM12878_normalized.cool \
--output data/HiC/hg38/GM12878
| Parameter | Required | Default | Description |
|---|---|---|---|
--input, -i |
Yes | - | Input cool file (from step 1) |
--output, -o |
Yes | - | Output directory for NPZ files |
--resolution, -r |
No | 10000 | Resolution in bp |
--window, -w |
No | 256 | Number of diagonals to extract |
Output
data/HiC/hg38/GM12878/
├── chr1.npz
├── chr2.npz
├── ...
└── chr22.npz
DNA Sequence Preparation
DNA sequences should be provided as gzipped FASTA files per chromosome.
If you have a whole-genome FASTA file, split it using:
# Index the genome
samtools faidx hg38.fa
# Extract each chromosome
for chr in chr{1..22}; do
samtools faidx hg38.fa $chr | gzip > DNA/hg38/${chr}.fa.gz
done
Expected structure:
DNA/hg38/
├── chr1.fa.gz
├── chr2.fa.gz
├── ...
└── chr22.fa.gz
Generating Generalized CTCF Profile for Custom Species
For species other than human (hg38) and mouse (mm10), you need to generate your own generalized CTCF binding profile. This profile represents the pan-tissue CTCF binding probability across multiple samples.
Data Collection
-
Collect CTCF ChIP-seq peak files (BED format) from multiple samples/tissues of your species
- Recommended: 50+ samples for robust generalization
- Data sources: Cistrome DB, ENCODE, GEO, or your own experiments
-
Ensure all BED files use the same genome assembly
Processing Steps
# Step 1: List all CTCF peak BED files
ls /path/to/ctcf_peaks/*.bed > peak_files.txt
# Step 2: Merge all peaks using bedtools multiinter
# This creates a file where each position shows how many samples have CTCF binding
bedtools multiinter -i $(cat peak_files.txt | tr '\n' ' ') > ctcf_merged.bed
# Step 3: Convert to bedGraph format with normalized scores
# The 4th column of multiinter output is the count of overlapping samples
# Normalize by total sample count to get probability (0-1)
TOTAL_SAMPLES=$(wc -l < peak_files.txt)
awk -v n="$TOTAL_SAMPLES" 'BEGIN{OFS="\t"} {print $1, $2, $3, $4/n}' ctcf_merged.bed > ctcf_normalized.bedGraph
# Step 4: Sort the bedGraph file
sort -k1,1 -k2,2n ctcf_normalized.bedGraph > ctcf_sorted.bedGraph
# Step 5: Convert to bigWig format
bedGraphToBigWig ctcf_sorted.bedGraph your_genome.chrom.sizes CTCF.bw
Output
The resulting CTCF.bw file contains binding probability scores ranging from 0 to 1:
- 0: No CTCF binding observed in any sample
- 1: CTCF binding observed in all samples
Place this file in your data directory:
/your/data_root/CTCF/{genome}/CTCF.bw
Training
Complete Data Structure Example
Before training, ensure your data is organized as follows. Here we use /home/user/hicompass_data as an example data root,
/home/user/hicompass_data/
├── ATAC/
│ └── hg38/
│ ├── GM12878~ATAC~bulk.bw
│ ├── GM12878~ATAC~1e6.bw
│ ├── GM12878~ATAC~5e5.bw
│ ├── K562~ATAC~bulk.bw
│ ├── K562~ATAC~1e6.bw
│ └── K562~ATAC~5e5.bw
├── HiC/
│ └── hg38/
│ ├── GM12878/
│ │ ├── chr1.npz
│ │ ├── chr2.npz
│ │ └── ...
│ └── K562/
│ ├── chr1.npz
│ ├── chr2.npz
│ └── ...
├── DNA/
│ └── hg38/
│ ├── chr1.fa.gz
│ ├── chr2.fa.gz
│ └── ...
├── CTCF/
│ └── hg38/
│ └── generalized_CTCF.bw
└── centromere/
└── hg38/
└── centromere.bed
Multi-Cell-Type & Multi-Depth Training
hicompass training \
--data-root /home/user/hicompass_data \
--cell-type GM12878 K562 \
--train-chr 1-17 \
--valid-chr 18-19 \
--train-depth bulk 1e6 5e5 \
--valid-depth bulk 1e6\
--genome hg38 \
--save-path /home/user/hicompass_checkpoints
If you only input one cell type, the discriminator of Hi-Compass will not be activated.
Training Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
--data-root |
Yes | - | Root directory containing all data subdirectories |
--cell-type |
Yes | - | Training cell type(s), space-separated |
--train-chr |
Yes | - | Training chromosomes (e.g., 1-17, chr1-chr17, 1 2 3) |
--valid-chr |
Yes | - | Validation chromosomes |
--genome |
No | hg38 | Genome assembly (hg38, mm10, or custom) |
--cell-type-valid |
No | same as train | Validation cell types |
--train-depth |
No | bulk | Training depths, space-separated |
--valid-depth |
No | bulk | Validation depths |
--batch-size |
No | 2 | Batch size per GPU |
--max-epochs |
No | 100 | Maximum training epochs |
--gpu-id |
No | auto | GPU ID(s) to use (e.g., 0 or 0 1 2 3) |
--save-path |
No | checkpoints | Directory for saving model checkpoints |
--ckpt-path |
No | - | Path to checkpoint for resuming training |
Chromosome Specification
The --train-chr and --valid-chr parameters support flexible formats:
| Format | Example | Expands to |
|---|---|---|
| Range | 1-5 |
chr1, chr2, chr3, chr4, chr5 |
| Range with prefix | chr1-chr5 |
chr1, chr2, chr3, chr4, chr5 |
| List | 1 3 5 |
chr1, chr3, chr5 |
Custom Genome Training
For genomes other than hg38 and mm10:
- Prepare chromosome sizes file at
{data-root}/chromsize/custom_genome_name/custom_genome_name.chrom.sizes - Prepare generalized CTCF BigWig at
{data-root}/CTCF/custom_genome_name/generalized_CTCF.bw - Specify
--genome custom_genome_name
hicompass training \
--data-root /home/user/hicompass_data \
--cell-type CellTypeA \
--train-chr 1-15 \
--valid-chr 16-17 \
--genome custom_genome_name \
--save-path /home/user/hicompass_checkpoints
Output
Training outputs are saved to the specified --save-path.
Prediction
Required Files
For prediction, you need the following files (can be placed anywhere):
- ATAC-seq BigWig file: Your cell-type-specific ATAC-seq data
- Model weights: Provided pre-trained or your own trained model (.pth file)
- Generalized CTCF BigWig: Provided for hg38/mm10, or generate your own
- DNA sequence directory: Directory containing chr*.fa.gz files
- Centromere BED file (optional): For filtering centromeric regions
Calculating ATAC-seq Depth
Before prediction, you need to know the sequencing depth of your ATAC-seq data. The depth is the total number of mapped reads.
# From BAM file
samtools view -c your_sample.bam
Prediction with Example Data
We provide a pre-processed K562 ATAC-seq BigWig file with known sequencing depth here for demonstration:
hicompass predicting \
--genome hg38 \
--model-path /path/to/hicompass_hg38.pth \
--atac-path /path/to/hg38/k562~ATAC~8e5.bw \
--ctcf-path /path/to/hg38/generalized_CTCF.bw \
--dna-dir /path/to/DNA/hg38 \
--output /path/to/output/my_sample_predicted.cool \
--centromere-bed /path/to/hg38/centromere.bed \
--depth 8000000 \
--chromosomes 1-22 \
--device cuda:0
Prediction Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
--model-path |
Yes | - | Path to trained model checkpoint |
--atac-path |
Yes | - | Path(s) to ATAC-seq BigWig file(s) |
--ctcf-path |
Yes | - | Path to generalized CTCF BigWig |
--dna-dir |
Yes | - | Directory containing chr*.fa.gz files |
--output |
Yes | - | Output path for predicted .cool file |
--depth |
Yes | - | Sequencing depth (total mapped reads) |
--genome |
No | hg38 | Genome assembly (hg38, mm10, or custom) |
--chrom-sizes |
No | - | Chromosome sizes file (required for custom genome) |
--centromere-bed |
No | - | BED file for centromere/telomere filtering |
--chromosomes |
No | 1-22 | Chromosomes to predict (e.g., 1,2,3 or 1-22) |
--stride |
No | 50 | Stride for sliding window (bins) |
--device |
No | cpu | Computation device (cpu, cuda, cuda:0) |
--batch-size |
No | 2 | Batch size for prediction |
--num-workers |
No | 16 | Number of data loading workers |
Output Format
The output is a .cool file with balanced weights, compatible with downstream analysis tools:
import cooler
# Load predicted Hi-C
clr = cooler.Cooler('/path/to/output/my_sample_predicted.cool')
# Use with cooltools
import cooltools
# Calculate insulation score
insulation = cooltools.insulation(clr, window_bp=100000)
Advanced Tutorial
This README covers basic usage of Hi-Compass. For advanced applications including single-cell analysis and downstream analyses, please refer to our Advanced Tutorial.
The advanced tutorial covers:
- Meta-cell Hi-C Prediction - Aggregate single-cell ATAC-seq into meta-cells and predict mcHi-C for each cell type
- mcHi-C Clustering - Perform dimensionality reduction and clustering on predicted mcHi-C matrices
- Loop Detection - Call chromatin loops and differential loops using Mustache
- Promoter-Enhancer Annotation - Annotate detected loops as potential promoter-enhancer interactions
- GWAS Variant Annotation - Link GWAS variants to target genes by identifying loops with both anchors overlapping GWAS-associated regions
Acknowledgments
We thank the developers of C.Origami for their pioneering work on cell-type-specific Hi-C prediction. The idea of using NPZ format conversion strategy for Hi-C data preprocessing and the basic backbone of dataset in Hi-Compass was adapted from their implementation.
License
Hi-Compass is released under the MIT License. See LICENSE for details.
Contact
For questions and feedback, please open an issue on GitHub or contact Yuanchen Sun (suneddiesyc@gmail.com).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hicompass-1.0.4.tar.gz.
File metadata
- Download URL: hicompass-1.0.4.tar.gz
- Upload date:
- Size: 426.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2a660bddc80d0f758b5ab5e47cd40f52ebd94216022aa83d35f504cf078f004
|
|
| MD5 |
cc51d4263c7fef1821467de8e1b8579e
|
|
| BLAKE2b-256 |
35f21b77256cd7e348325d23038982dafc099432d92b8fa8520b95a2c932b6c1
|
File details
Details for the file hicompass-1.0.4-py3-none-any.whl.
File metadata
- Download URL: hicompass-1.0.4-py3-none-any.whl
- Upload date:
- Size: 72.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ae65a4acc2235f6bbbdfb61c2a58348b9e90acefa4acf7f12e318f92269d611
|
|
| MD5 |
231c48e77cb1ac28421fc853cde84eda
|
|
| BLAKE2b-256 |
f84ee1afa9ae09f9fd6b17b1796d4450464bfa253596ed827b1eef482b714c7e
|