Increasing the quality of metagenome-assembled genomes with deep learning
Project description
Introduction
ResMiCo is a deep learning model capable of detecting metagenome assembly errors. ResMiCo's input is summary data derived from re-aligning reads against the putative genomes. ResMiCo's output is a number betwen 0 and 1 representing the likelihood that a particular genome was misassembled.
The tool is divided into two main parts:
- ResMiCo-SM
- A snakemake pipeline for:
- creating feature tables from real-world assemblies
- input: >=1 fasta of contigs, along with associated Illumina paired-end reads
- generating train/test datasets from reference genomes
- creating feature tables from real-world assemblies
- See the ResMiCo-SM README
- A snakemake pipeline for:
- ResMiCo (DL)
- A python package for misassembly detection via deep learning
Citation
If using ResMiCo in your work, please cite: ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning
Installation
Currently, please use pip
to install, but install the dependencies via mamba (or conda):
mamba env create -n resmico_env -f $RESMICO_BASE_DIR/environment.yml
mamba activate resmico_env
pip install resmico
WARNING: the resmico bioconda recipe is currently set to an old version of resmico. That old version does not match the current user interface (e.g., lacks
resmico bam2feat
). So, we do not recommend using the bioconda recipe for installing resmico at this time.
Running the ResMiCo package tests
Install pytest
and pytest-console-scripts
. For example:
mamba install pytest pytest-console-scripts
Run tests with pytest
pytest -s --hide-run-results --script-launch-mode=subprocess ./resmico/tests/
General usage
ResMiCo-SM snakemake pipeline
Use ResMiCo-SM for creating feature files from real data or simulate new data.
See the ResMiCo-SM README
Note
resmico bam2feat
can also be used to create feature tables from real data: contig fasta files & associated BAM files (mapped reads)
ResMiCo package
Main interface: resmico -h
Note: Although ResMiCo
can be run on a CPU, it is orders of magnitude
slower than on a GPU, so we only recommend running on CPUs for testing.
Creating feature tables
See resmico bam2feat -h
Predicting with existing model
See resmico evaluate -h
Filtering out misassembled contigs
See resmico filter -h
Training a new model
See resmico train -h
Example 1: predicting misassemblies with the "default" model
If you already have metagenome reads mapped to your contigs, you can process your own data much like in this example.
The model was trained with data produced via mapping Illumina paired-end reads with Bowtie2.
Working directory
mkdir example1 && cd example1
Get the example dataset
The dataset consists of a few UHGG genomes (MAGs) and associated BAM files. The BAM files were generated by using Bowtie2 to map the associated metagenome paired-end reads (from which the MAGs were assembled) to the MAG contigs.
So, the input consists of fasta files (contigs) and BAM files (mapped reads).
A simple tab-delimited table is used to map the fasta & BAM files.
Map file format:
* A tab-delim table with the columns (any order is allowed):
* `Taxon` => name associated with the fasta file of contigs
* `Fasta` => path to the fasta file of contigs
* `Sample` => name associated with the BAM file
* `BAM` => path to the BAM file of reads mapped to the contigs in `Fasta`
See the map.tsv
file for an example.
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_bam2feat.tar.gz
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_bam2feat.md5
md5sum --check UHGG-n9_bam2feat.md5
tar -pzxvf UHGG-n9_bam2feat.tar.gz && rm -f UHGG-n9_bam2feat.*
Convert BAM files to feature tables
Create a feature table for each sorted BAM file:
resmico bam2feat --outdir features UHGG-n9_bam2feat/map.tsv
Note: the parameters are the same as used for creating the "default" model from Mineeva et al., 2022, which is critical for getting accurate predictions.
Predict misassemblies
resmico evaluate \
--min-avg-coverage 0.01 \
--save-path predictions \
--save-name default-model \
--feature-files-path features
Note:
--min-avg-coverage
is set to "0.01" here due to the abnormally low coverage in these small example BAM files. DO NOT use such a low coverage cutoff with real data.
Filter contigs
Filter out contigs predicted to be misassembled
resmico filter \
--outdir filtered predictions/default-model.csv \
UHGG-n9_bam2feat/*.fna.gz
Note: change the
--score-cutoff
parameter to alter the number of contigs filtered.
Example2: Training & using a new model
Working directory
mkdir example2 && cd example2
Get the example dataset
Training data: simulated from 10 genomes in the GTDB
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/genomes-n10_features.tar.gz
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/genomes-n10_features.md5
md5sum --check genomes-n10_features.md5
tar -pzxvf genomes-n10_features.tar.gz && rm -f genomes-n10_features.*
Test data: simulated from 9 genomes in the UHGG
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_features.tar.gz
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_features.md5
md5sum --check UHGG-n9_features.md5
tar -pzxvf UHGG-n9_features.tar.gz && rm -f UHGG-n9_features.*
Filter out contigs predicted to be misassembled
Filter out contigs with prediction scores below a specific cutoff.
resmico filter \
--outdir filtered-contigs \
predictions/default-model.csv \
UHGG-n9_features/fasta/*fna.gz
You may need to adjust the
--score-cutoff
in order to filter some contigs
Training on the example train data
Train a new model with the example train dataset.
resmico train --log-progress --n-procs 4 --n-epochs 2 \
--save-path model-n10 --stats-file='' \
--save-name genomes-n10 \
--feature-files-path genomes-n10_features
Predict using the "default" model
Using the "default" resmico model from the Mineeva et al., 2022 manuscript. Prediction on the example test data. This provides a comparison to our newly trained model.
resmico evaluate --n-procs 4 \
--save-path predictions \
--save-name default-model \
--feature-files-path UHGG-n9_features/
Tutorials
See the wiki
Notes
Benchmarking
Model evaluation
Benchmarking resmico evaluate
on the CAMI2-gut
dataset:
- One GPU (NVIDIA RTX A5000): 108 +/- 0.7 contigs per second
- One CPU (AMD Epyc): 38.7 +/- 10.3 contigs per second
CAMI2-gut metagenome assembly stats:
- No. of metagemes: 10
- No. of contigs per sample (1000's): 18 +/- 6.4
- Avg. contig length (kbp): 4.1 +/- 0.9
Training
We highly recommend using multiple GPUs for model training on large datasets, as done in the ResMiCo paper. Training on CPUs with such large datasets is not feasbile.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file resmico-1.2.2.tar.gz
.
File metadata
- Download URL: resmico-1.2.2.tar.gz
- Upload date:
- Size: 6.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 540d709749befa646a4822f4fb878192ed129ed8dbebc8abe67da805608b7bda |
|
MD5 | 3a75bcf2f0437642c3bf24deff257631 |
|
BLAKE2b-256 | 7a78402664dfee78e167d94bf479fec25acf8c40581fe338de403b3d05e2df4b |
File details
Details for the file resmico-1.2.2-cp310-cp310-manylinux_2_31_x86_64.whl
.
File metadata
- Download URL: resmico-1.2.2-cp310-cp310-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 7.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcada0ce3f1241ff277fb1a99b70744d1ab5686f5ceb10b32ba7eeb7df91f06c |
|
MD5 | 021db39cc3817194e538b1727bbc5968 |
|
BLAKE2b-256 | e5c6c9c103770918df6808a4367be29f24901c5137287fd5a74c1b0048cbd70d |
File details
Details for the file resmico-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: resmico-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.10, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59aa69c1c959d4e03a2666ea19243284ade6a6c25c1ee6327e7b29702aeaeb8f |
|
MD5 | 3cf4f9ae775823bab2d52ea867c8bb48 |
|
BLAKE2b-256 | 2baf6106001e3e46708a6b9aa603845aaf3144d273e31c741e4b3a7de883abc6 |
File details
Details for the file resmico-1.2.2-cp39-cp39-manylinux_2_31_x86_64.whl
.
File metadata
- Download URL: resmico-1.2.2-cp39-cp39-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 7.3 MB
- Tags: CPython 3.9, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18aa9437367dcd8b4276c6803622fef5888ff36a1e4a6dfb5542d856ac271115 |
|
MD5 | aaf898e0944bff2e1041031bd92067db |
|
BLAKE2b-256 | 943c04d3f4f87ee6b1ec7b7f49de32b50f44796cba308d3a979ce9cd87482dd5 |
File details
Details for the file resmico-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: resmico-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6623ae8ce577a519de9b623f19aa9f713301677e853cea22acd6c4bfa8ef5dc7 |
|
MD5 | 20e929369395a057444bbe8744749252 |
|
BLAKE2b-256 | 127f98314fb516760c32245a7bf0122b1e6e6f19bc327b9174385de233ed9d48 |
File details
Details for the file resmico-1.2.2-cp38-cp38-manylinux_2_31_x86_64.whl
.
File metadata
- Download URL: resmico-1.2.2-cp38-cp38-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 7.3 MB
- Tags: CPython 3.8, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de58cfd1ac372e8c2f0a14c96bc4a5201ee6ff28433234dad9fa80993a092d6a |
|
MD5 | 57f4f941eb14163c64b41c1b4efcdcf9 |
|
BLAKE2b-256 | a535a258c77a84430cda3a722ed1369a0d7ae47f9f3d2b6fadca5055c0f023b7 |
File details
Details for the file resmico-1.2.2-cp38-cp38-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: resmico-1.2.2-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f01e15ae0729e4f1cc0f9ecefe7ba6db5c9838f14e5118e7d1d9db70e6c15dc5 |
|
MD5 | 888411ce684a6f9e0ee55788f88a6f7b |
|
BLAKE2b-256 | 9a06872f356b8c508a7586d50bb1750c78fd54b5842f05522cfa10e0c2f4191a |