Hi-Compass: Depth-aware deep learning framework for cell-type-specific chromatin interaction prediction from ATAC-seq

These details have not been verified by PyPI

Project links

Homepage

Project description

Hi-Compass

Hi-Compass is a depth-aware multi-modal deep learning framework for predicting cell-type-specific chromatin interactions from ATAC-seq data.

Overview
Key Features
Training Data Preparation
Preprocessing
Training
Prediction
Acknowledgments
License
Contact

Overview

Three-dimensional genome organization controls cell-type-specific gene expression through chromatin interactions. Hi-Compass addresses the challenge of predicting 3D genome structure from chromatin accessibility data by integrating four key inputs:

ATAC-seq signal
ATAC-seq sequencing depth
DNA sequence (fixed input, provided for hg38 and mm10)
Generalized CTCF binding profile (fixed input, provided for hg38 and mm10)

The model incorporates a depth-aware module that dynamically accommodates sequencing depth variations, enabling robust predictions across the full spectrum of data resolution—from sparse single-cell to high-coverage bulk profiles. Hi-Compass predicts Hi-C contact matrices at 10 kb resolution with 2 Mb window size.

Key Features

Cell-type-specific Hi-C prediction requiring only ATAC-seq as user-provided input; all other reference data (DNA sequence, generalized CTCF, etc.) are provided or reusable across samples
Complete preprocessing pipeline and training functionality, allowing users to train models on their own datasets
Support for human (hg38), mouse (mm10), and other species (custom species require user-provided generalized CTCF data)
Direct output of balanced .cool files compatible with downstream analysis tools such as cooltools, HiGlass, and Juicebox
Installation

Step 1: Install PyTorch

Hi-Compass requires PyTorch but does not install it automatically, as the correct version depends on your system and CUDA configuration.

Please install PyTorch first following the instructions at pytorch.org.

Step 2: Install Hi-Compass
```
pip install hicompass
```
Step 3: Install training dependencies (for training only)

If you plan to train your own models, please install the PyTorch Lightning with version that match your torch.

Step 4: Install preprocessing tools (for preprocessing only)

The preprocessing commands require the following external tools:
```
# Using conda
conda install -c bioconda samtools bedtools ucsc-bedgraphtobigwig
```

Training Data Preparation

Training Required Data

Hi-Compass requires the following input data:

Data Type	User Provided	Pre-built Available	Description
ATAC-seq	Yes	-	Cell-type-specific BAM file (for training) or BigWig file (for prediction)
Hi-C	Yes (training only)	-	10 kb resolution .cool file
DNA sequence	No	hg38, mm10	One-hot encoded sequences per chromosome
Generalized CTCF	No	hg38, mm10	Pan-tissue CTCF binding profile
Chromosome sizes	No	hg38, mm10	Chromosome size file
Centromere regions	No	hg38, mm10	BED file for filtering (optional)

Download Pre-built Reference Data

Pre-built reference data for human (hg38) are available for download:

Required for Prediction

File	Description	Download
DNA sequences	One-hot encoded DNA sequences per chromosome	DNA.zip
Generalized CTCF	Pan-tissue CTCF binding profile	CTCF.zip
Model weights	Pre-trained Hi-Compass model	model_weights.zip
Centromere regions	BED file for filtering (optional)	centromere.zip

Required for Training Only

File	Description	Download
ATAC-seq BAM	Example bulk ATAC-seq data (GM12878, IMR90)	IMR90_GM12878_ATAC_bam.zip
Hi-C cool	Example Hi-C matrices (GM12878, IMR90)	IMR90_GM12878_HiC_cool.zip

Download and organize the files according to the directory structure below.

Directory Structure

Hi-Compass expects data organized in the following structure for training (take hg38 as example):

/home/user/hicompass_data/
├── ATAC/
│   └── hg38/
│       ├── GM12878~ATAC~bulk.bw
│       ├── GM12878~ATAC~1e6.bw
│       ├── GM12878~ATAC~5e5.bw
│       ├── K562~ATAC~bulk.bw
│       ├── K562~ATAC~1e6.bw
│       └── K562~ATAC~5e5.bw
├── HiC/
│   └── hg38/
│       ├── GM12878/
│       │   ├── chr1.npz
│       │   ├── chr2.npz
│       │   └── ...
│       └── K562/
│           ├── chr1.npz
│           ├── chr2.npz
│           └── ...
├── DNA/
│   └── hg38/
│       ├── chr1.fa.gz
│       ├── chr2.fa.gz
│       └── ...
├── CTCF/
│   └── hg38/
│       └── generalized_CTCF.bw
└── centromere/
    └── hg38/
    		└── centromere.bed

ATAC-seq BigWig files follow the naming convention:

{CellType}~ATAC~{Depth}.bw

Examples:

GM12878~ATAC~bulk.bw - Bulk ATAC-seq (full depth)
GM12878~ATAC~1e6.bw - Subsampled to 1,000,000 reads
GM12878~ATAC~5e5.bw - Subsampled to 500,000 reads

This naming convention is automatically generated by the preprocessing pipeline and used by the training module to extract cell type and depth information.

Preprocessing

Hi-Compass provides preprocessing commands to prepare your data for training.

ATAC-seq Preprocessing

Convert bulk ATAC-seq BAM files to multi-depth BigWig files through stratified subsampling:

hicompass preprocess-atac \
    --input GM12878_bulk_ATAC.bam \
    --cell-type GM12878 \
    --output data/ATAC/hg38 \
    --chrom-sizes hg38.chrom.sizes

This command generates BigWig files at multiple sequencing depths, enabling the model to learn depth-aware representations.

Parameters

Parameter	Required	Default	Description
`--input`, `-i`	Yes	-	Input BAM/SAM file
`--cell-type`, `-c`	Yes	-	Cell type name (used in output filenames)
`--output`, `-o`	Yes	-	Output directory
`--chrom-sizes`, `-s`	Yes	-	Chromosome sizes file
`--depths`, `-d`	No	-	Custom depths (comma-separated or @file.txt)
`--min-depth`	No	2e5	Minimum depth for range mode
`--max-depth`	No	2e7	Maximum depth for range mode
`--step`	No	2e4	Step size for range mode
`--no-bulk`	No	False	Skip bulk BigWig generation
`--seed`	No	42	Random seed for reproducibility

Output

The command generates BigWig files following the naming convention {CellType}~ATAC~{Depth}.bw:

data/ATAC/hg38/
├── GM12878~ATAC~bulk.bw
├── GM12878~ATAC~2e5.bw
├── GM12878~ATAC~2.2e5.bw
├── GM12878~ATAC~2.4e5.bw
└── ...

Hi-C Preprocessing

Hi-C preprocessing consists of two steps:

Step 1: Contrast Stretching Normalization

Apply contrast stretching to enhance Hi-C matrix features:

hicompass preprocess-hic-norm \
    --input GM12878_raw.cool \
    --output GM12878_normalized.cool \
    --genome hg38

Parameter	Required	Default	Description
`--input`, `-i`	Yes	-	Input cool file
`--output`, `-o`	Yes	-	Output cool file
`--genome`, `-g`	No	hg38	Genome assembly (hg38, mm10, or custom)
`--chrom-sizes`, `-s`	No	-	Required when --genome=custom
`--resolution`, `-r`	No	10000	Resolution in bp (10 kb recommended)
`--percentile-max`	No	98.0	Upper percentile for contrast stretching

Note: Only 10 kb resolution is currently supported. A warning will be issued for other resolutions.

Step 2: Convert to NPZ Format

Convert the normalized cool file to NPZ format for training:

hicompass preprocess-hic-to-npz \
    --input GM12878_normalized.cool \
    --output data/HiC/hg38/GM12878

Parameter	Required	Default	Description
`--input`, `-i`	Yes	-	Input cool file (from step 1)
`--output`, `-o`	Yes	-	Output directory for NPZ files
`--resolution`, `-r`	No	10000	Resolution in bp
`--window`, `-w`	No	256	Number of diagonals to extract

Output

data/HiC/hg38/GM12878/
├── chr1.npz
├── chr2.npz
├── ...
└── chr22.npz

DNA Sequence Preparation

DNA sequences should be provided as gzipped FASTA files per chromosome.

If you have a whole-genome FASTA file, split it using:

# Index the genome
samtools faidx hg38.fa

# Extract each chromosome
for chr in chr{1..22}; do
    samtools faidx hg38.fa $chr | gzip > DNA/hg38/${chr}.fa.gz
done

Expected structure:

DNA/hg38/
├── chr1.fa.gz
├── chr2.fa.gz
├── ...
└── chr22.fa.gz

Generating Generalized CTCF Profile for Custom Species

For species other than human (hg38) and mouse (mm10), you need to generate your own generalized CTCF binding profile. This profile represents the pan-tissue CTCF binding probability across multiple samples.

Data Collection

Collect CTCF ChIP-seq peak files (BED format) from multiple samples/tissues of your species
- Recommended: 50+ samples for robust generalization
- Data sources: Cistrome DB, ENCODE, GEO, or your own experiments
Ensure all BED files use the same genome assembly

Processing Steps

# Step 1: List all CTCF peak BED files
ls /path/to/ctcf_peaks/*.bed > peak_files.txt

# Step 2: Merge all peaks using bedtools multiinter
#         This creates a file where each position shows how many samples have CTCF binding
bedtools multiinter -i $(cat peak_files.txt | tr '\n' ' ') > ctcf_merged.bed

# Step 3: Convert to bedGraph format with normalized scores
#         The 4th column of multiinter output is the count of overlapping samples
#         Normalize by total sample count to get probability (0-1)
TOTAL_SAMPLES=$(wc -l < peak_files.txt)
awk -v n="$TOTAL_SAMPLES" 'BEGIN{OFS="\t"} {print $1, $2, $3, $4/n}' ctcf_merged.bed > ctcf_normalized.bedGraph

# Step 4: Sort the bedGraph file
sort -k1,1 -k2,2n ctcf_normalized.bedGraph > ctcf_sorted.bedGraph

# Step 5: Convert to bigWig format
bedGraphToBigWig ctcf_sorted.bedGraph your_genome.chrom.sizes CTCF.bw

Output

The resulting CTCF.bw file contains binding probability scores ranging from 0 to 1:

0: No CTCF binding observed in any sample
1: CTCF binding observed in all samples

Place this file in your data directory:

/your/data_root/CTCF/{genome}/CTCF.bw

Training

Complete Data Structure Example

Before training, ensure your data is organized as follows. Here we use /home/user/hicompass_data as an example data root,

/home/user/hicompass_data/
├── ATAC/
│   └── hg38/
│       ├── GM12878~ATAC~bulk.bw
│       ├── GM12878~ATAC~1e6.bw
│       ├── GM12878~ATAC~5e5.bw
│       ├── K562~ATAC~bulk.bw
│       ├── K562~ATAC~1e6.bw
│       └── K562~ATAC~5e5.bw
├── HiC/
│   └── hg38/
│       ├── GM12878/
│       │   ├── chr1.npz
│       │   ├── chr2.npz
│       │   └── ...
│       └── K562/
│           ├── chr1.npz
│           ├── chr2.npz
│           └── ...
├── DNA/
│   └── hg38/
│       ├── chr1.fa.gz
│       ├── chr2.fa.gz
│       └── ...
├── CTCF/
│   └── hg38/
│       └── generalized_CTCF.bw
└── centromere/
    └── hg38/
    		└── centromere.bed

Multi-Cell-Type & Multi-Depth Training

hicompass training \
    --data-root /home/user/hicompass_data \
    --cell-type GM12878 K562 \
    --train-chr 1-17 \
    --valid-chr 18-19 \
    --train-depth bulk 1e6 5e5 \
    --valid-depth bulk 1e6\
    --genome hg38 \
    --save-path /home/user/hicompass_checkpoints

If you only input one cell type, the discriminator of Hi-Compass will not be activated.

Training Parameters

Parameter	Required	Default	Description
`--data-root`	Yes	-	Root directory containing all data subdirectories
`--cell-type`	Yes	-	Training cell type(s), space-separated
`--train-chr`	Yes	-	Training chromosomes (e.g., `1-17`, `chr1-chr17`, `1 2 3`)
`--valid-chr`	Yes	-	Validation chromosomes
`--genome`	No	hg38	Genome assembly (hg38, mm10, or custom)
`--cell-type-valid`	No	same as train	Validation cell types
`--train-depth`	No	bulk	Training depths, space-separated
`--valid-depth`	No	bulk	Validation depths
`--batch-size`	No	2	Batch size per GPU
`--max-epochs`	No	100	Maximum training epochs
`--gpu-id`	No	auto	GPU ID(s) to use (e.g., `0` or `0 1 2 3`)
`--save-path`	No	checkpoints	Directory for saving model checkpoints
`--ckpt-path`	No	-	Path to checkpoint for resuming training

Chromosome Specification

The --train-chr and --valid-chr parameters support flexible formats:

Format	Example	Expands to
Range	`1-5`	chr1, chr2, chr3, chr4, chr5
Range with prefix	`chr1-chr5`	chr1, chr2, chr3, chr4, chr5
List	`1 3 5`	chr1, chr3, chr5

Custom Genome Training

For genomes other than hg38 and mm10:

Prepare chromosome sizes file at {data-root}/chromsize/custom_genome_name/custom_genome_name.chrom.sizes
Prepare generalized CTCF BigWig at {data-root}/CTCF/custom_genome_name/generalized_CTCF.bw
Specify --genome custom_genome_name

hicompass training \
    --data-root /home/user/hicompass_data \
    --cell-type CellTypeA \
    --train-chr 1-15 \
    --valid-chr 16-17 \
    --genome custom_genome_name \
    --save-path /home/user/hicompass_checkpoints

Output

Training outputs are saved to the specified --save-path.

Prediction

Required Files

For prediction, you need the following files (can be placed anywhere):

ATAC-seq BigWig file: Your cell-type-specific ATAC-seq data
Model weights: Provided pre-trained or your own trained model (.pth file)
Generalized CTCF BigWig: Provided for hg38/mm10, or generate your own
DNA sequence directory: Directory containing chr*.fa.gz files
Centromere BED file (optional): For filtering centromeric regions

Calculating ATAC-seq Depth

Before prediction, you need to know the sequencing depth of your ATAC-seq data. The depth is the total number of mapped reads.

# From BAM file
samtools view -c your_sample.bam

Prediction with Example Data

We provide a pre-processed K562 ATAC-seq BigWig file with known sequencing depth here for demonstration:

hicompass predicting \
		--genome hg38 \
    --model-path /path/to/hicompass_hg38.pth \
    --atac-path /path/to/hg38/k562~ATAC~8e5.bw \
    --ctcf-path /path/to/hg38/generalized_CTCF.bw \
    --dna-dir /path/to/DNA/hg38 \
    --output /path/to/output/my_sample_predicted.cool \
    --centromere-bed /path/to/hg38/centromere.bed \
    --depth 8000000 \
    --chromosomes 1-22 \
    --device cuda:0

Prediction Parameters

Parameter	Required	Default	Description
`--model-path`	Yes	-	Path to trained model checkpoint
`--atac-path`	Yes	-	Path(s) to ATAC-seq BigWig file(s)
`--ctcf-path`	Yes	-	Path to generalized CTCF BigWig
`--dna-dir`	Yes	-	Directory containing chr*.fa.gz files
`--output`	Yes	-	Output path for predicted .cool file
`--depth`	Yes	-	Sequencing depth (total mapped reads)
`--genome`	No	hg38	Genome assembly (hg38, mm10, or custom)
`--chrom-sizes`	No	-	Chromosome sizes file (required for custom genome)
`--centromere-bed`	No	-	BED file for centromere/telomere filtering
`--chromosomes`	No	1-22	Chromosomes to predict (e.g., `1,2,3` or `1-22`)
`--stride`	No	50	Stride for sliding window (bins)
`--device`	No	cpu	Computation device (cpu, cuda, cuda:0)
`--batch-size`	No	2	Batch size for prediction
`--num-workers`	No	16	Number of data loading workers

Output Format

The output is a .cool file with balanced weights, compatible with downstream analysis tools:

import cooler

# Load predicted Hi-C
clr = cooler.Cooler('/path/to/output/my_sample_predicted.cool')

# Use with cooltools
import cooltools

# Calculate insulation score
insulation = cooltools.insulation(clr, window_bp=100000)

Advanced Tutorial

This README covers basic usage of Hi-Compass. For advanced applications including single-cell analysis and downstream analyses, please refer to our Advanced Tutorial.

The advanced tutorial covers:

Meta-cell Hi-C Prediction - Aggregate single-cell ATAC-seq into meta-cells and predict mcHi-C for each cell type
mcHi-C Clustering - Perform dimensionality reduction and clustering on predicted mcHi-C matrices
Loop Detection - Call chromatin loops and differential loops using Mustache
Promoter-Enhancer Annotation - Annotate detected loops as potential promoter-enhancer interactions
GWAS Variant Annotation - Link GWAS variants to target genes by identifying loops with both anchors overlapping GWAS-associated regions

Acknowledgments

We thank the developers of C.Origami for their pioneering work on cell-type-specific Hi-C prediction. The idea of using NPZ format conversion strategy for Hi-C data preprocessing and the basic backbone of dataset in Hi-Compass was adapted from their implementation.

License

Hi-Compass is released under the MIT License. See LICENSE for details.

Contact

For questions and feedback, please open an issue on GitHub or contact Yuanchen Sun (suneddiesyc@gmail.com).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.4

Mar 13, 2026

This version

1.0.3

Dec 24, 2025

1.0.2

Dec 16, 2025

1.0.1

Dec 16, 2025

1.0.0

Dec 16, 2025

0.1.7

Mar 19, 2025

0.1.6

Mar 16, 2025

0.1.5

Mar 16, 2025

0.1.4

Mar 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hicompass-1.0.3.tar.gz (67.4 kB view details)

Uploaded Dec 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hicompass-1.0.3-py3-none-any.whl (72.1 kB view details)

Uploaded Dec 24, 2025 Python 3

File details

Details for the file hicompass-1.0.3.tar.gz.

File metadata

Download URL: hicompass-1.0.3.tar.gz
Upload date: Dec 24, 2025
Size: 67.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hicompass-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`934f1f2efdc32aa4f69472a12953fc952811e84e9d1fdacef04b5ac8f1219ff9`
MD5	`261b00fd93769020eb1676a7342821c2`
BLAKE2b-256	`f24ec527ca0d8bbbcdfcf78cf2d4cfab159c40ee205ec353f73618823628d415`

See more details on using hashes here.

File details

Details for the file hicompass-1.0.3-py3-none-any.whl.

File metadata

Download URL: hicompass-1.0.3-py3-none-any.whl
Upload date: Dec 24, 2025
Size: 72.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hicompass-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d2970aaf9e208483c3e85d8302f2f03666544c73cda3dfd5dcc5bcd6e40e4a1`
MD5	`e2acbc48cd883884f9f03a2eea7e3369`
BLAKE2b-256	`ce09a29620ef8c3c9a6e1c759509dce2371de2652cd183c383cf01a3409a9fd1`

See more details on using hashes here.

hicompass 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Hi-Compass

Table of Contents

Overview

Key Features

Installation

Step 1: Install PyTorch

Step 2: Install Hi-Compass

Step 3: Install training dependencies (for training only)

Step 4: Install preprocessing tools (for preprocessing only)

Training Data Preparation

Download Pre-built Reference Data

Required for Prediction

Required for Training Only

Directory Structure

Preprocessing

ATAC-seq Preprocessing

Parameters

Output

Hi-C Preprocessing

Step 1: Contrast Stretching Normalization

Step 2: Convert to NPZ Format

Output

DNA Sequence Preparation

Generating Generalized CTCF Profile for Custom Species

Data Collection

Processing Steps

Output

Training

Complete Data Structure Example

Multi-Cell-Type & Multi-Depth Training

Training Parameters

Chromosome Specification

Custom Genome Training

Output

Prediction

Required Files

Calculating ATAC-seq Depth

Prediction with Example Data

Prediction Parameters

Output Format

Advanced Tutorial

Acknowledgments

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes