Recovery of high-quality eukaryotic genomes from complex metagenomes
Project description
REMAG
REcovery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.
Quick Start
# Install via pip (recommended)
pip install remag
# Or via conda
conda install -c bioconda remag
# Or use Docker
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
-f /data/contigs.fasta -c /data/alignments.bam -o /data/output
# Run REMAG (if installed locally)
remag -f contigs.fasta -c alignments.bam -o output_directory
Installation
From PyPI (recommended)
# Create conda environment (optional but recommended)
conda create -n remag python=3.9
conda activate remag
# Install from PyPI
pip install remag
From conda (bioconda)
# Install directly from bioconda
conda install -c bioconda remag
# Or create a new environment
conda create -n remag -c bioconda remag
conda activate remag
Using Docker
# Pull and run the latest version
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
-f /data/contigs.fasta -c /data/alignments.bam -o /data/output
# Or use a specific version
docker run --rm -v $(pwd):/data danielzmbp/remag:0.1.2 \
-f /data/contigs.fasta -c /data/alignments.bam -o /data/output
# For interactive use
docker run -it --rm -v $(pwd):/data danielzmbp/remag:latest /bin/bash
From source
# Create and activate conda environment
conda create -n remag python=3.9
conda activate remag
# Clone and install
git clone https://github.com/danielzmbp/remag.git
cd remag
pip install .
Development installation
For contributors and developers:
# Install with development dependencies
pip install -e ".[dev]"
GPU-accelerated installation
For GPU-accelerated clustering (requires NVIDIA GPU):
# Install with RAPIDS support
pip install "remag[gpu]"
Usage
Command line interface
After installation, you can use REMAG via the command line:
remag -f contigs.fasta -c alignments.bam -o output_directory
Python module mode
python -m remag -f contigs.fasta -c alignments.bam -o output_directory
How REMAG Works
REMAG uses a sophisticated multi-stage pipeline specifically designed for eukaryotic genome recovery:
- Bacterial Pre-filtering: By default, REMAG automatically filters out bacterial contigs using the integrated 4CAC classifier (can be disabled with
--skip-bacterial-filter) - Feature Extraction: Combines k-mer composition (4-mers) with coverage profiles across multiple samples. Large contigs are split into overlapping fragments for augmentation during training
- Contrastive Learning: Trains a Siamese neural network using the Barlow Twins self-supervised loss function. This creates embeddings where fragments from the same contig are close together
- HDBSCAN Clustering: Density-based clustering on the learned contig embeddings to form bins
- Quality Assessment: Uses miniprot to align bins against a database of eukaryotic core genes to detect contamination
- Iterative Refinement: Automatically splits contaminated bins based on core gene duplications to improve bin quality
Key Features
- Automatic Bacterial Filtering: The 4CAC classifier automatically identifies and removes bacterial sequences before binning
- Multi-Sample Support: Can process coverage information from multiple samples (BAM/CRAM files) simultaneously
- Barlow Twins Loss: Uses a self-supervised contrastive learning approach that doesn't require negative pairs
- Fragment Augmentation: Large contigs are split into multiple overlapping fragments during training to improve representation learning
Options
-f, --fasta PATH Input FASTA file with contigs to bin. Can be gzipped. [required]
-c, --coverage PATH Coverage files for calculation. Supports BAM, CRAM (indexed), and TSV formats. Auto-detects format by extension. Each file represents one sample. Supports space-separated paths and glob patterns (e.g., "*.bam", "*.cram", "*.tsv"). Use quotes around glob patterns.
-o, --output PATH Output directory for results. [required]
--epochs INTEGER RANGE Training epochs for neural network. [default: 400; 50<=x<=2000]
--batch-size INTEGER RANGE Batch size for training. [default: 2048; 64<=x<=8192]
--embedding-dim INTEGER RANGE Embedding dimension for contrastive learning. [default: 256; 64<=x<=512]
--base-learning-rate FLOAT RANGE
Base learning rate for optimizer. [default: 0.008; 0.00001<=x<=0.1]
--min-cluster-size INTEGER RANGE
Minimum fragments per cluster. [default: 2; 2<=x<=100]
--min-samples INTEGER RANGE Minimum samples for HDBSCAN core points. [default: None; 1<=x<=100]
--cluster-selection-epsilon FLOAT RANGE
Epsilon for HDBSCAN cluster selection. [default: 0.0; 0.0<=x<=1.0]
--min-contig-length INTEGER RANGE
Minimum contig length in bp. [default: 1000; 500<=x<=10000]
--max-positive-pairs INTEGER RANGE
Maximum positive pairs for contrastive learning. [default: 5000000; 100000<=x<=10000000]
-t, --threads INTEGER RANGE Number of CPU threads. [default: 8; 1<=x<=64]
--min-bin-size INTEGER RANGE Minimum bin size in bp. [default: 100000; 50000<=x<=10000000]
-v, --verbose Enable verbose logging.
--skip-bacterial-filter Skip bacterial contig filtering (4CAC classifier + contrastive learning).
--skip-refinement Skip bin refinement.
--skip-kmeans-filtering Skip K-means filtering on embeddings.
--max-refinement-rounds INTEGER RANGE
Maximum refinement rounds. [default: 2; 1<=x<=10]
--num-augmentations INTEGER RANGE
Number of random fragments per contig. [default: 8; 1<=x<=32]
--keep-intermediate Keep intermediate files (training fragments, etc.).
-h, --help Show this message and exit.
Output
REMAG produces several output files:
Core output files (always created):
bins/: Directory containing FASTA files for each binbins.csv: Final contig-to-bin assignmentsremag.log: Detailed log file*_non_bacterial_filtered.fasta: Filtered FASTA file with bacterial contigs removed (when bacterial filtering is enabled)
Additional files (with --keep-intermediate option):
embeddings.csv: Contig embeddings from the neural networkumap_embeddings.csv: UMAP projections for visualizationumap_plot.pdf: UMAP visualization plot with cluster assignmentssiamese_model.pt: Trained Siamese neural network modelparams.json: Complete run parameters for reproducibilityfeatures.csv: Extracted k-mer and coverage featuresfragments.pkl: Fragment information used during trainingclassification_results.csv: 4CAC bacterial classification resultsrefinement_summary.json: Summary of the bin refinement processkmeans_filtering_stats.json: Statistics from k-means pre-filtering (if enabled)core_gene_duplication_results.json: Core gene duplication analysis from refinementtemp_miniprot/: Temporary directory for miniprot alignments (removed unless --keep-intermediate)
Requirements
- Python 3.8+
- PyTorch (≥1.11.0)
- scikit-learn (≥1.0.0)
- XGBoost (≥1.6.0) - for 4CAC classifier
- HDBSCAN (≥0.8.28)
- UMAP (≥0.5.0)
- pandas (≥1.3.0)
- numpy (≥1.21.0)
- matplotlib (≥3.5.0)
- pysam (≥0.18.0)
- loguru (≥0.6.0)
- tqdm (≥4.62.0)
- rich-click (≥1.5.0)
- joblib (≥1.1.0)
The package includes a pre-trained 4CAC classifier model for bacterial contig filtering. The 4CAC classifier code and models are adapted from the Shamir-Lab/4CAC repository.
Acknowledgments
The integrated 4CAC classifier (xgbclass module) is adapted from the work by Shamir Lab:
- Repository: Shamir-Lab/4CAC
- Paper: Pu L, Shamir R. 4CAC: 4-class classifier of metagenome contigs using machine learning and assembly graphs. Nucleic Acids Res. 2024;52(19):e94–e94.
License
MIT License - see LICENSE file for details.
Citation
If you use REMAG in your research, please cite:
@software{gomez_perez_2025_remag,
author = {Gómez-Pérez, Daniel},
title = {REMAG: Recovering high-quality Eukaryotic genomes from complex metagenomes},
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.16443991},
url = {https://doi.org/10.5281/zenodo.16443991}
}
Note: The DOI 10.5281/zenodo.16443991 represents all versions and will always resolve to the latest release. A manuscript describing REMAG is in preparation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file remag-0.1.3.tar.gz.
File metadata
- Download URL: remag-0.1.3.tar.gz
- Upload date:
- Size: 76.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0dccd45bf4cc199e5ef573f903a4241396a8e90e78341d75210e4e0f252dcf2a
|
|
| MD5 |
b75cfe662c141a322d3d34211282420e
|
|
| BLAKE2b-256 |
d76180625843cd456f9bc688d0b091880a970fbc02c09b0ed08780804a0d106b
|
File details
Details for the file remag-0.1.3-py3-none-any.whl.
File metadata
- Download URL: remag-0.1.3-py3-none-any.whl
- Upload date:
- Size: 76.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b580bf9d1edbe77466384c0d631bc391b9dcdce09fd059d0d8b3a40ffefc13f
|
|
| MD5 |
1ca600f48958a6791ec02710bd6494fa
|
|
| BLAKE2b-256 |
803d6b0ddea98f537cd4fcfac7404850330d1c132cb36354d27e78a14b3eb187
|