Skip to main content

EMD-guided masked autoencoder for chromatin interaction map restoration

Project description

EMMA

EMMA is an EMD-guided restoration toolkit for chromatin interaction maps. It restores complete or low-quality genomic-bin regions in Hi-C, Pore-C, and contact-like 3D genome matrices by combining distance-diagonal signal decomposition, masked IMF autoencoder correction, and mode-weighted reconstruction.

EMMA overview

What EMMA Does

EMMA supports three common workflows:

  • Restore known missing regions from a user-provided mask.
  • Automatically detect low-coverage or missing genomic bins, then restore them.
  • Reconstruct or lightly enhance a contact matrix without an explicit missing mask.

The restore mode keeps observed entries unchanged and only replaces entries marked by the imputation mask.

Installation

Option A. Install From PyPI

After the package is released to PyPI:

pip install emma-3dgenome

Option B. Install Directly From GitHub

pip install "git+https://github.com/ydduanran/EMMA.git"

Option C. Install From Local Source

Clone the repository and install it locally:

git clone https://github.com/ydduanran/EMMA.git
cd EMMA
pip install .

For editable development:

pip install -e ".[dev]"

If your machine has no internet access but already has the required build tools installed, use:

pip install -e . --no-build-isolation

Core dependencies include numpy, scipy, torch, cooler, EMD-signal, scikit-learn, and scikit-image.

Version 0.2.0 Performance Notes

EMMA 0.2.0 changes the default masked-autoencoder training path to avoid recomputing PyEMD inside every DataLoader sample. The default training pipeline now pseudo-masks IMF channels directly and keeps EMD decomposition in the matrix-level preprocessing stage. This makes GPU training substantially less CPU-bound.

To reproduce the slower 0.1.x training behavior that recomputes EMD for every pseudo-masked sample, use:

emma restore sample.mcool \
  --resolution 10000 \
  --chrom chr2 \
  --mask-regions missing_regions.bed \
  --recompute-pseudo-emd \
  --output emma_out_legacy/

For CUDA runs on larger windows, start with:

emma restore sample.mcool \
  --resolution 10000 \
  --chrom chr2 \
  --mask-regions missing_regions.bed \
  --device cuda:0 \
  --batch-size 256 \
  --num-workers 8 \
  --inference-batch-size 256 \
  --output emma_out/

Quick Start

1. Restore With A BED Missing-Region File

Use this when you know which genomic intervals need imputation.

emma restore sample.mcool \
  --resolution 10000 \
  --chrom chr2 \
  --mask-regions missing_regions.bed \
  --output emma_out/

missing_regions.bed should use:

chrom  start  end

Example:

chr2    2000000    2250000
chr2    7600000    7900000

2. Restore With A Boolean Matrix Mask

Use this when you already have a square boolean mask where True marks entries to impute.

emma restore sample.npy \
  --mask mask.npy \
  --output emma_out/

3. Auto-Detect Missing Bins And Restore

Use this when missing or low-coverage bins are not known in advance.

emma restore sample.mcool \
  --resolution 10000 \
  --chrom chr2 \
  --auto-mask \
  --auto-mask-mode balanced \
  --output emma_auto_out/

Available auto-mask modes:

  • conservative
  • balanced
  • aggressive

You can exclude assembly gaps, centromeres, telomeres, blacklist regions, or other regions that should not be imputed:

emma restore sample.mcool \
  --resolution 10000 \
  --chrom chr2 \
  --auto-mask \
  --exclude-bed hg38_exclude_regions.bed \
  --output emma_auto_out/

4. Reconstruct Without Explicit Missing Regions

Use this for conservative EMMA-style matrix reconstruction or enhancement.

emma reconstruct sample.mcool \
  --resolution 10000 \
  --chrom chr2 \
  --mode conservative \
  --blend 0.2 \
  --output reconstructed_out/

Modes:

  • conservative: lightly blends the reconstruction with the original matrix.
  • full: uses the reconstructed matrix directly.

Input Formats

EMMA currently supports:

  • .cool
  • .mcool
  • .npy
  • .npz

Rules:

  • .mcool requires --resolution.
  • .cool and .mcool require --chrom.
  • .npy should contain a square contact matrix.
  • .npz reads key matrix if present; otherwise it reads the first array. Use --key to choose a specific array.

Output Files

emma restore writes:

restored.npy
prediction_only.npy
masked_input.npy
mask.npy
mask_regions.bed
config.json
report.json
diag_stats.json
log.txt

emma detect writes:

mask.npy
detected_missing_bins.tsv
detected_missing_regions.bed
excluded_bins.tsv
auto_mask_diagnostics.tsv
report.json

emma reconstruct writes:

reconstructed.npy
difference.npy
config.json
report.json
diag_stats.json
log.txt

Python API

from emma_3dgenome import EmmaRestorer
from emma_3dgenome.io import load_contact_matrix
from emma_3dgenome.masks import load_mask_regions

matrix = load_contact_matrix(
    "sample.mcool",
    chrom="chr2",
    resolution=10000,
)

mask_info = load_mask_regions(
    "missing_regions.bed",
    chrom="chr2",
    resolution=10000,
    n_bins=matrix.shape[0],
)

restorer = EmmaRestorer(preset="default", device="cuda:0")
result = restorer.restore(
    matrix,
    mask=mask_info.mask,
    regions=mask_info.regions,
)
result.save("emma_out")

Auto-mask restoration:

from emma_3dgenome import EmmaRestorer
from emma_3dgenome.io import load_contact_matrix

matrix = load_contact_matrix("sample.mcool", chrom="chr2", resolution=10000)
restorer = EmmaRestorer(preset="default", device="cuda:0")

result = restorer.restore_auto(
    matrix,
    chrom="chr2",
    resolution=10000,
    auto_mask_mode="balanced",
)
result.save("emma_auto_out")

Matrix reconstruction:

result = restorer.reconstruct(matrix, mode="conservative", blend=0.2)
result.save("reconstructed_out")

IMF Parameters

Default mode-weighted reconstruction parameters:

max_imfs = 5
imf_weights = 0.08 1.35 1.20 1.90 0.80
residual_weight = 1.0
diag_calib_strength = 0.20

Interpretation:

  • IMF1: high-frequency noise component, strongly down-weighted.
  • IMF2: local structure component, enhanced.
  • IMF3: intermediate-scale structure component, enhanced.
  • IMF4: domain or boundary-related structure component, strongly enhanced.
  • IMF5: low-frequency structure component, slightly retained.
  • Residual: global trend, retained.

Override from CLI:

emma restore sample.mcool \
  --resolution 10000 \
  --chrom chr2 \
  --mask-regions missing_regions.bed \
  --max-imfs 5 \
  --imf-weights 0.08 1.35 1.20 1.90 0.80 \
  --residual-weight 1.0 \
  --diag-calib-strength 0.20 \
  --output emma_out/

Presets

Available presets:

  • default
  • paper
  • smooth
  • sharp
  • conservative
  • fast

Use fast for small smoke tests. Use default or paper for standard restoration.

Minimal Test

python -m compileall -q src tests
python -m pytest -q tests

If pytest is not installed:

pip install -e ".[dev]"

Citation

If you use EMMA in your work, please cite the EMMA manuscript when it becomes available.

@article{emma2026,
  title = {EMMA: EMD-guided masked autoencoder restoration of chromatin interaction maps},
  author = {To be updated},
  journal = {To be updated},
  year = {2026}
}

License

This project is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emma_3dgenome-0.2.0.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emma_3dgenome-0.2.0-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file emma_3dgenome-0.2.0.tar.gz.

File metadata

  • Download URL: emma_3dgenome-0.2.0.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.9

File hashes

Hashes for emma_3dgenome-0.2.0.tar.gz
Algorithm Hash digest
SHA256 947efff21a0d86f46d251fe0fe5622a7e72a91ad8a045ddf0867837fc129273f
MD5 450ebfbb7e4832f0cda1fe3327362438
BLAKE2b-256 fce3c50d39f4d167ec700e35e76cc1b2e713e323ec7678765150574796acffdd

See more details on using hashes here.

File details

Details for the file emma_3dgenome-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: emma_3dgenome-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.9

File hashes

Hashes for emma_3dgenome-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1f34ea20e42cbda77dcbc6f70bc28a0bb394a5c834420949203794df926f6405
MD5 f79cdeecf0948a790e4b4dce73840e7f
BLAKE2b-256 443d48de485cf166afc510024c37c63695c5df2dede5164c2d23ffa148d81306

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page