Recovery of high-quality eukaryotic genomes from complex metagenomes
Project description
REMAG
REcovery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.
Index
- Installation
- Quick Start
- Usage
- Common Options
- How It Works
- Output
- Requirements
- Acknowledgments
- License
- Citation
Installation
Conda (recommended)
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
PyPI
Install miniprot separately first:
conda create -n remag python=3.9
conda activate remag
conda install -c bioconda miniprot
pip install remag
From source
conda create -n remag python=3.9
conda activate remag
git clone https://github.com/danielzmbp/remag.git
cd remag
conda install -c bioconda miniprot
pip install .
Development installation
pip install -e ".[dev]"
Docker
docker pull danielzmbp/remag:latest
Optional plotting dependencies
pip install remag[plotting]
GPU acceleration
REMAG uses PyTorch and will use GPU acceleration automatically when a supported backend is available. No extra REMAG flag is required.
Conda with NVIDIA CUDA
If you want a CUDA-enabled PyTorch build, install REMAG first and then replace the CPU PyTorch package with the CUDA-enabled one that matches your system:
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
conda install -c pytorch -c nvidia pytorch pytorch-cuda=12.1
Adjust the CUDA version to match your driver and platform.
Apple Silicon
On Apple Silicon, PyTorch can use Metal (mps) automatically when available. In most cases no extra REMAG-specific setup is needed beyond installing a current PyTorch build.
PyPI installs
If you install REMAG with pip, install the PyTorch build you want first, then install REMAG:
conda create -n remag python=3.9
conda activate remag
conda install -c bioconda miniprot
# Install the desired PyTorch build first
pip install torch
# Then install REMAG
pip install remag
For NVIDIA systems, use the PyTorch install command from the official PyTorch selector so the wheel matches your CUDA runtime.
Quick Start
Conda
remag contigs.fasta -c alignments.bam
Docker
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam -o /data/output
Singularity
singularity build remag.sif docker://danielzmbp/remag:latest
singularity run --bind $(pwd):/data remag.sif \
/data/contigs.fasta -c /data/alignments.bam -o /data/output
Usage
Command line interface
After installation, you can use REMAG via the command line:
# Basic usage
remag contigs.fasta -c alignments.bam
# With explicit output directory
remag contigs.fasta -c alignments.bam -o output_directory
# Multiple samples
remag contigs.fasta -c sample1.bam -c sample2.bam
# Multiple samples using shell-expanded globs
remag contigs.fasta -c samples/*.bam
# Using precomputed coverage tables (one TSV per sample)
remag contigs.fasta -c sample1.tsv -c sample2.tsv
# Only run eukaryotic filtering (skip binning)
remag contigs.fasta -c alignments.bam --filter-only
# Use single-cell mode (adjusts k-NN defaults and skips eukaryotic filtering)
remag contigs.fasta -c alignments.bam -m single-cell
# Keep intermediate files
remag contigs.fasta -c alignments.bam -k
Python module mode
python -m remag contigs.fasta -c alignments.bam
Coverage TSV format
Precomputed coverage TSVs are supported as an alternative to BAM/CRAM. Use one TSV per sample.
- Column 1: contig ID
- Last column: coverage value for that contig
- No header row
Example:
contig_1 12.4
contig_2 3.8
contig_3 0.0
TSV input provides contig-level coverage only. REMAG cannot infer fragment-specific coverage for augmented fragments from a TSV, so every fragment from the same contig gets the same coverage value. Use BAM/CRAM if you want fragment-level augmented coverage features. Do not mix TSV inputs with BAM/CRAM inputs in the same run.
Common Options
-c, --coverage: one or more BAM, CRAM, or TSV coverage inputs-o, --output: output directory; defaults toremag_outputnext to the input FASTA-k, --keep-intermediate: retain embeddings, features, model weights, and other intermediate files--filter-only: stop after eukaryotic filtering and write filtered FASTA output-m, --mode: select presets such asmetagenomicsorsingle-cell--save-filtered-contigs: also write the contigs removed by the eukaryotic filter
Use remag -h for a quick reference and remag --help for the full CLI, including training, clustering, filtering, and rescue options.
How It Works
REMAG recovers eukaryotic bins with a multi-stage pipeline:
- Eukaryotic filtering: By default, REMAG filters contigs with the integrated HyenaDNA classifier. This step can be disabled with
--skip-bacterial-filter. - Feature extraction: REMAG combines 4-mer composition with optional multi-sample coverage data. Large contigs are augmented into multiple fragments for training.
- Contrastive learning: A Siamese network trained with Barlow Twins learns embeddings that place fragments from the same contig close together.
- Core gene annotation:
miniprotmaps eukaryotic single-copy core genes to support clustering and quality checks. - Greedy clustering and rescue: REMAG applies greedy Leiden clustering across multiple resolutions, then merges or rescues bins when single-copy gene checks support it.
Output
Core outputs
bins/: Directory containing FASTA files for each binbins.csv: Final contig-to-bin assignmentsembeddings.csv: Contig embeddings from the neural networkremag.log: Detailed log file*_eukaryotic_filtered.fasta: Filtered FASTA written when eukaryotic filtering is enabled
Additional outputs with -k / --keep-intermediate
siamese_model.pt: Trained Siamese neural network modelkmer_embeddings.csv: K-mer encoder embeddings (before fusion)coverage_embeddings.csv: Coverage encoder embeddings (before fusion)params.json: Run parameters for reproducibilityfeatures.csv: Extracted k-mer and coverage featuresfragments.pkl: Fragment information used during training*_hyenadna_classification.tsv: HyenaDNA eukaryotic classification results (tab-separated)gene_contig_mappings.json: Cached gene-to-contig mappingscore_gene_duplication_results.json: Core gene duplication analysisknn_graph_edges.csv: k-NN graph edge list used for Leiden clusteringknn_graph_stats.json: k-NN graph construction statisticstemp_miniprot/: Temporary directory for miniprot alignments
Filtering output
*_non_eukaryotic.fasta: Contigs removed by the HyenaDNA filter when--save-filtered-contigsis used
Visualization
With plotting dependencies installed, you can generate UMAP plots from embeddings.csv and bins.csv:
pip install "remag[plotting]"
python scripts/plot_features.py --features output_directory/embeddings.csv --clusters output_directory/bins.csv --output output_directory
umap_coordinates.csv: UMAP projections for visualizationumap_plot.pdf: UMAP visualization plot with cluster assignments
Requirements
- Python 3.9+
miniprotis required for core gene analysis when installing outside conda packages or the project Docker image- Plotting extras are optional:
pip install remag[plotting]
The package includes a pre-trained HyenaDNA classifier model for eukaryotic contig filtering.
Acknowledgments
The integrated HyenaDNA classifier uses a pre-trained genomic foundation model:
- Repository: HazyResearch/hyena-dna
- Paper: Nguyen E, Poli M, Faizi M, et al. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS 2023.
License
MIT License - see LICENSE file for details.
Citation
If you use REMAG in your research, please cite:
@article {G{\'o}mez-P{\'e}rez2026.03.05.709928,
author = {G{\'o}mez-P{\'e}rez, Daniel and Raguideau, S{\'e}bastien and Warring, Sally and James, Robert and Hildebrand, Falk and Quince, Christopher},
title = {REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning},
elocation-id = {2026.03.05.709928},
year = {2026},
doi = {10.64898/2026.03.05.709928},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2026/03/08/2026.03.05.709928},
eprint = {https://www.biorxiv.org/content/early/2026/03/08/2026.03.05.709928.full.pdf},
journal = {bioRxiv}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file remag-0.4.2.tar.gz.
File metadata
- Download URL: remag-0.4.2.tar.gz
- Upload date:
- Size: 25.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecaac516289786ac62d2ac9c01e5ef5c61675ec8d0f633fee8b0a56b5003813d
|
|
| MD5 |
0f59946ec1ea7a0b5700125b1ff9d0cc
|
|
| BLAKE2b-256 |
3cfa51a611ea891cd8927b1b6d66bcdcc24098364619c4d2e2d8f03fc341eaf5
|
File details
Details for the file remag-0.4.2-py3-none-any.whl.
File metadata
- Download URL: remag-0.4.2-py3-none-any.whl
- Upload date:
- Size: 25.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be14607b8a9ced675bfa3614906819d5fc3ec53fcd08182beb83c0bc66096699
|
|
| MD5 |
034789a9e66d556bd0c410bd268e7c03
|
|
| BLAKE2b-256 |
290ad4aef3554b6b0d08a0e23d9d650684d693b56883a71e1380da56ef42186a
|