Skip to main content

Hi-Compass: Depth-aware deep learning framework for cell-type-specific chromatin interaction prediction from ATAC-seq

Project description

Hi-Compass

Hi-Compass is a depth-aware multi-modal deep learning framework for predicting cell-type-specific chromatin interactions from ATAC-seq data.

Table of Contents

Overview

Three-dimensional genome organization controls cell-type-specific gene expression through chromatin interactions. Hi-Compass addresses the challenge of predicting 3D genome structure from chromatin accessibility data by integrating four key inputs:

  • ATAC-seq signal
  • ATAC-seq sequencing depth
  • DNA sequence (fixed input, provided for hg38 and mm10)
  • Generalized CTCF binding profile (fixed input, provided for hg38 and mm10)

The model incorporates a depth-aware module that dynamically accommodates sequencing depth variations, enabling robust predictions across the full spectrum of data resolution—from sparse single-cell to high-coverage bulk profiles. Hi-Compass predicts Hi-C contact matrices at 10 kb resolution with 2 Mb window size.

Key Features

  • Cell-type-specific Hi-C prediction requiring only ATAC-seq as user-provided input; all other reference data (DNA sequence, generalized CTCF, etc.) are provided or reusable across samples

  • Complete preprocessing pipeline and training functionality, allowing users to train models on their own datasets

  • Support for human (hg38), mouse (mm10), and other species (custom species require user-provided generalized CTCF data)

  • Direct output of balanced .cool files compatible with downstream analysis tools such as cooltools, HiGlass, and Juicebox

  • Installation

    Step 1: Install PyTorch

    Hi-Compass requires PyTorch but does not install it automatically, as the correct version depends on your system and CUDA configuration.

    Please install PyTorch first following the instructions at pytorch.org.

    Step 2: Install Hi-Compass

    pip install hicompass
    

    Step 3: Install training dependencies (for training only)

    If you plan to train your own models, please install the PyTorch Lightning with version that match your torch.

    Step 4: Install preprocessing tools (for preprocessing only)

    The preprocessing commands require the following external tools:

    # Using conda
    conda install -c bioconda samtools bedtools ucsc-bedgraphtobigwig
    

Training Data Preparation

Training Required Data

Hi-Compass requires the following input data:

Data Type User Provided Pre-built Available Description
ATAC-seq Yes - Cell-type-specific BAM file (for training) or BigWig file (for prediction)
Hi-C Yes (training only) - 10 kb resolution .cool file
DNA sequence No hg38, mm10 One-hot encoded sequences per chromosome
Generalized CTCF No hg38, mm10 Pan-tissue CTCF binding profile
Chromosome sizes No hg38, mm10 Chromosome size file
Centromere regions No hg38, mm10 BED file for filtering (optional)

Download Pre-built Reference Data

Pre-built reference data for human (hg38) are available for download:

Required for Prediction

File Description Download
DNA sequences One-hot encoded DNA sequences per chromosome DNA.zip
Generalized CTCF Pan-tissue CTCF binding profile CTCF.zip
Model weights Pre-trained Hi-Compass model model_weights.zip
Centromere regions BED file for filtering (optional) centromere.zip

Required for Training Only

File Description Download
ATAC-seq BAM Example bulk ATAC-seq data (GM12878, IMR90) IMR90_GM12878_ATAC_bam.zip
Hi-C cool Example Hi-C matrices (GM12878, IMR90) IMR90_GM12878_HiC_cool.zip

Download and organize the files according to the directory structure below.

Directory Structure

Hi-Compass expects data organized in the following structure for training (take hg38 as example):

/home/user/hicompass_data/
├── ATAC/
│   └── hg38/
│       ├── GM12878~ATAC~bulk.bw
│       ├── GM12878~ATAC~1e6.bw
│       ├── GM12878~ATAC~5e5.bw
│       ├── K562~ATAC~bulk.bw
│       ├── K562~ATAC~1e6.bw
│       └── K562~ATAC~5e5.bw
├── HiC/
│   └── hg38/
│       ├── GM12878/
│       │   ├── chr1.npz
│       │   ├── chr2.npz
│       │   └── ...
│       └── K562/
│           ├── chr1.npz
│           ├── chr2.npz
│           └── ...
├── DNA/
│   └── hg38/
│       ├── chr1.fa.gz
│       ├── chr2.fa.gz
│       └── ...
├── CTCF/
│   └── hg38/
│       └── generalized_CTCF.bw
└── centromere/
    └── hg38/
    		└── centromere.bed

ATAC-seq BigWig files follow the naming convention:

{CellType}~ATAC~{Depth}.bw

Examples:

  • GM12878~ATAC~bulk.bw - Bulk ATAC-seq (full depth)
  • GM12878~ATAC~1e6.bw - Subsampled to 1,000,000 reads
  • GM12878~ATAC~5e5.bw - Subsampled to 500,000 reads

This naming convention is automatically generated by the preprocessing pipeline and used by the training module to extract cell type and depth information.

Preprocessing

Hi-Compass provides preprocessing commands to prepare your data for training.

ATAC-seq Preprocessing

Convert bulk ATAC-seq BAM files to multi-depth BigWig files through stratified subsampling:

hicompass preprocess-atac \
    --input GM12878_bulk_ATAC.bam \
    --cell-type GM12878 \
    --output data/ATAC/hg38 \
    --chrom-sizes hg38.chrom.sizes

This command generates BigWig files at multiple sequencing depths, enabling the model to learn depth-aware representations.

Parameters

Parameter Required Default Description
--input, -i Yes - Input BAM/SAM file
--cell-type, -c Yes - Cell type name (used in output filenames)
--output, -o Yes - Output directory
--chrom-sizes, -s Yes - Chromosome sizes file
--depths, -d No - Custom depths (comma-separated or @file.txt)
--min-depth No 2e5 Minimum depth for range mode
--max-depth No 2e7 Maximum depth for range mode
--step No 2e4 Step size for range mode
--no-bulk No False Skip bulk BigWig generation
--seed No 42 Random seed for reproducibility

Output

The command generates BigWig files following the naming convention {CellType}~ATAC~{Depth}.bw:

data/ATAC/hg38/
├── GM12878~ATAC~bulk.bw
├── GM12878~ATAC~2e5.bw
├── GM12878~ATAC~2.2e5.bw
├── GM12878~ATAC~2.4e5.bw
└── ...

Hi-C Preprocessing

Hi-C preprocessing consists of two steps:

Step 1: Contrast Stretching Normalization

Apply contrast stretching to enhance Hi-C matrix features:

hicompass preprocess-hic-norm \
    --input GM12878_raw.cool \
    --output GM12878_normalized.cool \
    --genome hg38
Parameter Required Default Description
--input, -i Yes - Input cool file
--output, -o Yes - Output cool file
--genome, -g No hg38 Genome assembly (hg38, mm10, or custom)
--chrom-sizes, -s No - Required when --genome=custom
--resolution, -r No 10000 Resolution in bp (10 kb recommended)
--percentile-max No 98.0 Upper percentile for contrast stretching

Note: Only 10 kb resolution is currently supported. A warning will be issued for other resolutions.

Step 2: Convert to NPZ Format

Convert the normalized cool file to NPZ format for training:

hicompass preprocess-hic-to-npz \
    --input GM12878_normalized.cool \
    --output data/HiC/hg38/GM12878
Parameter Required Default Description
--input, -i Yes - Input cool file (from step 1)
--output, -o Yes - Output directory for NPZ files
--resolution, -r No 10000 Resolution in bp
--window, -w No 256 Number of diagonals to extract

Output

data/HiC/hg38/GM12878/
├── chr1.npz
├── chr2.npz
├── ...
└── chr22.npz

DNA Sequence Preparation

DNA sequences should be provided as gzipped FASTA files per chromosome.

If you have a whole-genome FASTA file, split it using:

# Index the genome
samtools faidx hg38.fa

# Extract each chromosome
for chr in chr{1..22}; do
    samtools faidx hg38.fa $chr | gzip > DNA/hg38/${chr}.fa.gz
done

Expected structure:

DNA/hg38/
├── chr1.fa.gz
├── chr2.fa.gz
├── ...
└── chr22.fa.gz

Generating Generalized CTCF Profile for Custom Species

For species other than human (hg38) and mouse (mm10), you need to generate your own generalized CTCF binding profile. This profile represents the pan-tissue CTCF binding probability across multiple samples.

Data Collection

  1. Collect CTCF ChIP-seq peak files (BED format) from multiple samples/tissues of your species

    • Recommended: 50+ samples for robust generalization
    • Data sources: Cistrome DB, ENCODE, GEO, or your own experiments
  2. Ensure all BED files use the same genome assembly

Processing Steps

# Step 1: List all CTCF peak BED files
ls /path/to/ctcf_peaks/*.bed > peak_files.txt

# Step 2: Merge all peaks using bedtools multiinter
#         This creates a file where each position shows how many samples have CTCF binding
bedtools multiinter -i $(cat peak_files.txt | tr '\n' ' ') > ctcf_merged.bed

# Step 3: Convert to bedGraph format with normalized scores
#         The 4th column of multiinter output is the count of overlapping samples
#         Normalize by total sample count to get probability (0-1)
TOTAL_SAMPLES=$(wc -l < peak_files.txt)
awk -v n="$TOTAL_SAMPLES" 'BEGIN{OFS="\t"} {print $1, $2, $3, $4/n}' ctcf_merged.bed > ctcf_normalized.bedGraph

# Step 4: Sort the bedGraph file
sort -k1,1 -k2,2n ctcf_normalized.bedGraph > ctcf_sorted.bedGraph

# Step 5: Convert to bigWig format
bedGraphToBigWig ctcf_sorted.bedGraph your_genome.chrom.sizes CTCF.bw

Output

The resulting CTCF.bw file contains binding probability scores ranging from 0 to 1:

  • 0: No CTCF binding observed in any sample
  • 1: CTCF binding observed in all samples

Place this file in your data directory:

/your/data_root/CTCF/{genome}/CTCF.bw

Training

Complete Data Structure Example

Before training, ensure your data is organized as follows. Here we use /home/user/hicompass_data as an example data root,

/home/user/hicompass_data/
├── ATAC/
│   └── hg38/
│       ├── GM12878~ATAC~bulk.bw
│       ├── GM12878~ATAC~1e6.bw
│       ├── GM12878~ATAC~5e5.bw
│       ├── K562~ATAC~bulk.bw
│       ├── K562~ATAC~1e6.bw
│       └── K562~ATAC~5e5.bw
├── HiC/
│   └── hg38/
│       ├── GM12878/
│       │   ├── chr1.npz
│       │   ├── chr2.npz
│       │   └── ...
│       └── K562/
│           ├── chr1.npz
│           ├── chr2.npz
│           └── ...
├── DNA/
│   └── hg38/
│       ├── chr1.fa.gz
│       ├── chr2.fa.gz
│       └── ...
├── CTCF/
│   └── hg38/
│       └── generalized_CTCF.bw
└── centromere/
    └── hg38/
    		└── centromere.bed

Multi-Cell-Type & Multi-Depth Training

hicompass training \
    --data-root /home/user/hicompass_data \
    --cell-type GM12878 K562 \
    --train-chr 1-17 \
    --valid-chr 18-19 \
    --train-depth bulk 1e6 5e5 \
    --valid-depth bulk 1e6\
    --genome hg38 \
    --save-path /home/user/hicompass_checkpoints

If you only input one cell type, the discriminator of Hi-Compass will not be activated.

Training Parameters

Parameter Required Default Description
--data-root Yes - Root directory containing all data subdirectories
--cell-type Yes - Training cell type(s), space-separated
--train-chr Yes - Training chromosomes (e.g., 1-17, chr1-chr17, 1 2 3)
--valid-chr Yes - Validation chromosomes
--genome No hg38 Genome assembly (hg38, mm10, or custom)
--cell-type-valid No same as train Validation cell types
--train-depth No bulk Training depths, space-separated
--valid-depth No bulk Validation depths
--batch-size No 2 Batch size per GPU
--max-epochs No 100 Maximum training epochs
--gpu-id No auto GPU ID(s) to use (e.g., 0 or 0 1 2 3)
--save-path No checkpoints Directory for saving model checkpoints
--ckpt-path No - Path to checkpoint for resuming training

Chromosome Specification

The --train-chr and --valid-chr parameters support flexible formats:

Format Example Expands to
Range 1-5 chr1, chr2, chr3, chr4, chr5
Range with prefix chr1-chr5 chr1, chr2, chr3, chr4, chr5
List 1 3 5 chr1, chr3, chr5

Custom Genome Training

For genomes other than hg38 and mm10:

  1. Prepare chromosome sizes file at {data-root}/chromsize/custom_genome_name/custom_genome_name.chrom.sizes
  2. Prepare generalized CTCF BigWig at {data-root}/CTCF/custom_genome_name/generalized_CTCF.bw
  3. Specify --genome custom_genome_name
hicompass training \
    --data-root /home/user/hicompass_data \
    --cell-type CellTypeA \
    --train-chr 1-15 \
    --valid-chr 16-17 \
    --genome custom_genome_name \
    --save-path /home/user/hicompass_checkpoints

Output

Training outputs are saved to the specified --save-path.

Prediction

Required Files

For prediction, you need the following files (can be placed anywhere):

  1. ATAC-seq BigWig file: Your cell-type-specific ATAC-seq data
  2. Model weights: Provided pre-trained or your own trained model (.pth file)
  3. Generalized CTCF BigWig: Provided for hg38/mm10, or generate your own
  4. DNA sequence directory: Directory containing chr*.fa.gz files
  5. Centromere BED file (optional): For filtering centromeric regions

Calculating ATAC-seq Depth

Before prediction, you need to know the sequencing depth of your ATAC-seq data. The depth is the total number of mapped reads.

# From BAM file
samtools view -c your_sample.bam

Prediction with Example Data

We provide a pre-processed K562 ATAC-seq BigWig file with known sequencing depth here for demonstration:

hicompass predicting \
		--genome hg38 \
    --model-path /path/to/hicompass_hg38.pth \
    --atac-path /path/to/hg38/k562~ATAC~8e5.bw \
    --ctcf-path /path/to/hg38/generalized_CTCF.bw \
    --dna-dir /path/to/DNA/hg38 \
    --output /path/to/output/my_sample_predicted.cool \
    --centromere-bed /path/to/hg38/centromere.bed \
    --depth 8000000 \
    --chromosomes 1-22 \
    --device cuda:0

Prediction Parameters

Parameter Required Default Description
--model-path Yes - Path to trained model checkpoint
--atac-path Yes - Path(s) to ATAC-seq BigWig file(s)
--ctcf-path Yes - Path to generalized CTCF BigWig
--dna-dir Yes - Directory containing chr*.fa.gz files
--output Yes - Output path for predicted .cool file
--depth Yes - Sequencing depth (total mapped reads)
--genome No hg38 Genome assembly (hg38, mm10, or custom)
--chrom-sizes No - Chromosome sizes file (required for custom genome)
--centromere-bed No - BED file for centromere/telomere filtering
--chromosomes No 1-22 Chromosomes to predict (e.g., 1,2,3 or 1-22)
--stride No 50 Stride for sliding window (bins)
--device No cpu Computation device (cpu, cuda, cuda:0)
--batch-size No 2 Batch size for prediction
--num-workers No 16 Number of data loading workers

Output Format

The output is a .cool file with balanced weights, compatible with downstream analysis tools:

import cooler

# Load predicted Hi-C
clr = cooler.Cooler('/path/to/output/my_sample_predicted.cool')

# Use with cooltools
import cooltools

# Calculate insulation score
insulation = cooltools.insulation(clr, window_bp=100000)

Advanced Tutorial

This README covers basic usage of Hi-Compass. For advanced applications including single-cell analysis and downstream analyses, please refer to our Advanced Tutorial.

The advanced tutorial covers:

  1. Meta-cell Hi-C Prediction - Aggregate single-cell ATAC-seq into meta-cells and predict mcHi-C for each cell type
  2. mcHi-C Clustering - Perform dimensionality reduction and clustering on predicted mcHi-C matrices
  3. Loop Detection - Call chromatin loops and differential loops using Mustache
  4. Promoter-Enhancer Annotation - Annotate detected loops as potential promoter-enhancer interactions
  5. GWAS Variant Annotation - Link GWAS variants to target genes by identifying loops with both anchors overlapping GWAS-associated regions

Acknowledgments

We thank the developers of C.Origami for their pioneering work on cell-type-specific Hi-C prediction. The idea of using NPZ format conversion strategy for Hi-C data preprocessing and the basic backbone of dataset in Hi-Compass was adapted from their implementation.

License

Hi-Compass is released under the MIT License. See LICENSE for details.

Contact

For questions and feedback, please open an issue on GitHub or contact Yuanchen Sun (suneddiesyc@gmail.com).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hicompass-1.0.3.tar.gz (67.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hicompass-1.0.3-py3-none-any.whl (72.1 kB view details)

Uploaded Python 3

File details

Details for the file hicompass-1.0.3.tar.gz.

File metadata

  • Download URL: hicompass-1.0.3.tar.gz
  • Upload date:
  • Size: 67.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hicompass-1.0.3.tar.gz
Algorithm Hash digest
SHA256 934f1f2efdc32aa4f69472a12953fc952811e84e9d1fdacef04b5ac8f1219ff9
MD5 261b00fd93769020eb1676a7342821c2
BLAKE2b-256 f24ec527ca0d8bbbcdfcf78cf2d4cfab159c40ee205ec353f73618823628d415

See more details on using hashes here.

File details

Details for the file hicompass-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: hicompass-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 72.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hicompass-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7d2970aaf9e208483c3e85d8302f2f03666544c73cda3dfd5dcc5bcd6e40e4a1
MD5 e2acbc48cd883884f9f03a2eea7e3369
BLAKE2b-256 ce09a29620ef8c3c9a6e1c759509dce2371de2652cd183c383cf01a3409a9fd1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page