Recovery of high-quality eukaryotic genomes from complex metagenomes
Project description
REMAG
REcovery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.
Quick Start
Option 1: Using Conda (Recommended - handles all dependencies)
# Create environment and install everything
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
# Run REMAG (output directory optional - defaults to remag_output)
remag contigs.fasta -c alignments.bam
Option 2: Using Docker (No local installation needed)
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam -o /data/output
Option 3: Using pip
# Create environment first
conda create -n remag python=3.9
conda activate remag
# Install dependencies and REMAG
conda install -c bioconda miniprot
pip install remag
# Run REMAG
remag contigs.fasta -c alignments.bam
Installation
Recommended: Conda Installation
This is the easiest method as conda handles all dependencies automatically:
# Create a new environment with all dependencies
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
# Verify installation
remag --help
Note: miniprot is pulled in automatically as a dependency of the conda package; no separate installation is required when installing remag via conda.
Alternative: PyPI Installation
If you prefer pip, you'll need to install the external dependency separately:
# Step 1: Create and activate environment
conda create -n remag python=3.9
conda activate remag
# Step 2: Install external dependency
conda install -c bioconda miniprot
# Step 3: Install REMAG from PyPI
pip install remag
Advanced Conda Setup
For additional features:
# Basic installation
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
# Add optional plotting capabilities
conda install -c conda-forge matplotlib umap-learn
Using Docker
# Pull and run the latest version (output directory defaults to remag_output)
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam
# Or specify output directory
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam -o /data/output
# For interactive use
docker run -it --rm -v $(pwd):/data danielzmbp/remag:latest /bin/bash
Using Singularity
# Pull and run the latest version directly
singularity run docker://danielzmbp/remag:latest \
contigs.fasta -c alignments.bam
# Build Singularity image from Docker Hub
singularity build remag_v0.3.4.sif docker://danielzmbp/remag:v0.3.4
# Or build latest version
singularity build remag_latest.sif docker://danielzmbp/remag:latest
# Run with Singularity
singularity run --bind $(pwd):/data remag_v0.3.4.sif \
/data/contigs.fasta -c /data/alignments.bam
# Or use exec for direct command execution
singularity exec --bind $(pwd):/data remag_v0.3.4.sif \
remag /data/contigs.fasta -c /data/alignments.bam -o /data/output
# For interactive shell
singularity shell --bind $(pwd):/data remag_v0.3.4.sif
# Build a local Singularity image file (optional)
singularity build remag.sif docker://danielzmbp/remag:latest
singularity run remag.sif contigs.fasta -c alignments.bam
From source
# Create and activate conda environment
conda create -n remag python=3.9
conda activate remag
# Clone and install
git clone https://github.com/danielzmbp/remag.git
cd remag
pip install .
Development installation
For contributors and developers:
# Install with development dependencies
pip install -e ".[dev]"
Optional Features Installation
For visualization capabilities:
# Install with plotting dependencies
pip install "remag[plotting]"
Usage
Command line interface
After installation, you can use REMAG via the command line:
# Basic usage (output defaults to remag_output in FASTA directory)
remag contigs.fasta -c alignments.bam
# With explicit output directory
remag contigs.fasta -c alignments.bam -o output_directory
# Multiple samples using glob patterns
remag contigs.fasta -c "samples/*.bam"
# Using explicit -f flag (both styles work)
remag -f contigs.fasta -c alignments.bam
# Keep intermediate files with -k shorthand
remag contigs.fasta -c alignments.bam -k
# Only run eukaryotic filtering (skip binning)
remag contigs.fasta -c alignments.bam --filter-only
# Use single-cell mode (adjusts k-NN and clustering defaults)
remag contigs.fasta -c alignments.bam -m single-cell
Python module mode
python -m remag contigs.fasta -c alignments.bam
Getting help
# Quick reference (basic options)
remag -h
# Full documentation (all advanced options)
remag --help
How REMAG Works
REMAG uses a sophisticated multi-stage pipeline specifically designed for eukaryotic genome recovery:
- Eukaryotic Filtering: By default, REMAG automatically filters for eukaryotic contigs using the integrated HyenaDNA LLM-based classifier (can be disabled with
--skip-bacterial-filter) - Feature Extraction: Combines k-mer composition (4-mers) with coverage profiles across multiple samples. Large contigs are split into overlapping fragments for augmentation during training
- Contrastive Learning: Trains a Siamese neural network using the Barlow Twins self-supervised loss function. This creates embeddings where fragments from the same contig are close together
- Eukaryotic Gene Marker Annotation: Uses miniprot to annotate contigs with eukaryotic single-copy core genes, providing the quality metrics needed for clustering decisions
- Greedy Clustering: Iteratively extracts bins using a greedy Leiden approach -- at each step, tests multiple Leiden resolutions on the remaining contigs, selects the single best-quality cluster (by F1 score of completeness vs. contamination), removes it from the graph, and repeats
- Bin Rescue: Merges fragmented bins into larger bins based on embedding similarity and single-copy gene safety, and rescues unbinned contigs into matching bins
Key Features
- Automatic Eukaryotic Filtering: The HyenaDNA classifier uses a pre-trained genomic foundation model to identify and retain eukaryotic sequences
- Multi-Sample Support: Can process coverage information from multiple samples (BAM/CRAM files) simultaneously
- Greedy Multi-Resolution Clustering: Iteratively extracts bins by testing multiple Leiden resolutions at each step, allowing different bins to use different resolutions for optimal quality
- Barlow Twins Loss: Uses a self-supervised contrastive learning approach that doesn't require negative pairs
- Fragment Augmentation: Large contigs are split into multiple overlapping fragments during training to improve representation learning
- Bin Rescue: Merges fragmented bins and rescues unbinned contigs into existing bins based on embedding similarity and single-copy gene safety
Options
Use remag -h for quick reference or remag --help for full documentation.
Essential Options
FASTA_ARG Input FASTA file (positional argument). Can also use -f/--fasta
-f, --fasta PATH Input FASTA file with contigs to bin. Can be gzipped.
-c, --coverage PATH Coverage files for calculation. Supports BAM, CRAM (indexed), and TSV formats.
Auto-detects format by extension. Supports space-separated paths and glob patterns
(e.g., "*.bam", "*.cram", "*.tsv"). Use quotes around glob patterns.
-o, --output PATH Output directory for results. [default: remag_output in FASTA directory]
-t, --threads INTEGER Number of CPU cores to use for parallel processing. [default: 8]
-v, --verbose Enable verbose logging.
-k, --keep-intermediate Keep intermediate files (embeddings, features, model, etc.).
-h, --help Show quick reference or full help.
Advanced Options
For complete list of advanced options (neural network parameters, clustering settings, refinement options, etc.), run:
remag --help
Output
REMAG produces several output files:
Core output files (always created):
bins/: Directory containing FASTA files for each binbins.csv: Final contig-to-bin assignmentsembeddings.csv: Contig embeddings from the neural networkremag.log: Detailed log file*_eukaryotic_filtered.fasta: Filtered FASTA file with only eukaryotic contigs retained (when eukaryotic filtering is enabled)
Additional files (with -k / --keep-intermediate option):
siamese_model.pt: Trained Siamese neural network modelkmer_embeddings.csv: K-mer encoder embeddings (before fusion)coverage_embeddings.csv: Coverage encoder embeddings (before fusion)params.json: Complete run parameters for reproducibilityfeatures.csv: Extracted k-mer and coverage featuresfragments.pkl: Fragment information used during training*_hyenadna_classification.tsv: HyenaDNA eukaryotic classification results (tab-separated)gene_contig_mappings.json: Cached gene-to-contig mappings for faster processingcore_gene_duplication_results.json: Core gene duplication analysischimera_detection_results.json: Chimera detection results for large contigsknn_graph_edges.csv: k-NN graph edge list used for Leiden clusteringknn_graph_stats.json: k-NN graph construction statisticstemp_miniprot/: Temporary directory for miniprot alignments (removed unless --keep-intermediate)
Visualization (optional, requires plotting dependencies):
To generate UMAP visualization plots:
# Install plotting dependencies if not already installed
pip install remag[plotting]
# Generate UMAP visualization from embeddings
python scripts/plot_features.py --features output_directory/embeddings.csv --clusters output_directory/bins.csv --output output_directory
This creates:
umap_coordinates.csv: UMAP projections for visualizationumap_plot.pdf: UMAP visualization plot with cluster assignments
Requirements
Core dependencies (always installed):
- Python 3.9+
- PyTorch (≥1.11.0)
- einops (≥0.6.0) - for HyenaDNA model operations
- scikit-learn (≥1.0.0)
- leidenalg (≥0.9.0) - for graph-based clustering
- igraph (≥0.10.0) - for graph construction in Leiden clustering
- pandas (≥1.3.0)
- numpy (≥1.21.0)
- scipy (≥1.6.0)
- pysam (≥0.18.0)
- loguru (≥0.6.0)
- tqdm (≥4.62.0)
- rich-click (≥1.5.0)
External dependencies (must be installed separately):
- miniprot - Required for core gene analysis and quality assessment
- Install with:
conda install -c bioconda miniprot
- Install with:
Optional dependencies:
- For visualization: matplotlib (≥3.5.0), umap-learn (≥0.5.0)
- Install with:
pip install remag[plotting]
- Install with:
The package includes a pre-trained HyenaDNA classifier model for eukaryotic contig filtering. The HyenaDNA model is a genomic foundation model based on the Hyena operator architecture.
Acknowledgments
The integrated HyenaDNA classifier uses a pre-trained genomic foundation model:
- Repository: HazyResearch/hyena-dna
- Paper: Nguyen E, Poli M, Faizi M, et al. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS 2023.
License
MIT License - see LICENSE file for details.
Citation
If you use REMAG in your research, please cite:
@software{gomez_perez_2025_remag,
author = {Gómez-Pérez, Daniel},
title = {REMAG: Recovering high-quality Eukaryotic genomes from complex metagenomes},
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.16443991},
url = {https://doi.org/10.5281/zenodo.16443991}
}
Note: The DOI 10.5281/zenodo.16443991 represents all versions and will always resolve to the latest release. A manuscript describing REMAG is in preparation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file remag-0.4.0.tar.gz.
File metadata
- Download URL: remag-0.4.0.tar.gz
- Upload date:
- Size: 27.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31d04b9154d97f70c7382f25e517495f9f1ba9a7c03b1d55b7488b559686c100
|
|
| MD5 |
c5b403dad67c512990ada81535fa733b
|
|
| BLAKE2b-256 |
6488e94bf0f837c8729d7071d5f3a33dde8084ed816a39d95881f2ad96c15d74
|
File details
Details for the file remag-0.4.0-py3-none-any.whl.
File metadata
- Download URL: remag-0.4.0-py3-none-any.whl
- Upload date:
- Size: 27.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d9866371e73b242589b7bfcdbd644b5fa9f43c2d4fb1a6ab4b7789ea6382d4d
|
|
| MD5 |
ab45d3e62c871430242733304978faa8
|
|
| BLAKE2b-256 |
320bcc933f780fd1218a17fe5a2abb5016c1d797c79cd5ed9d9bae2577725cb0
|