Skip to main content

Increasing the quality of metagenome-assembled genomes with deep learning

Project description

ResMiCo

overview

Introduction

ResMiCo is a deep learning model capable of detecting metagenome assembly errors. ResMiCo's input is summary data derived from re-aligning reads against the putative genomes. ResMiCo's output is a number betwen 0 and 1 representing the likelihood that a particular genome was misassembled.

The tool is divided into two main parts:

  • ResMiCo-SM
    • A snakemake pipeline for:
      • creating feature tables from real-world assemblies
        • input: >=1 fasta of contigs, along with associated Illumina paired-end reads
      • generating train/test datasets from reference genomes
    • See the ResMiCo-SM README
  • ResMiCo (DL)
    • A python package for misassembly detection via deep learning

Citation

If using ResMiCo in your work, please cite: ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning

Installation

Currently, please use pip to install, but install the dependencies via mamba (or conda):

mamba env create -n resmico_env -f $RESMICO_BASE_DIR/environment.yml
mamba activate resmico_env
pip install resmico

WARNING: the resmico bioconda recipe is currently set to an old version of resmico. That old version does not match the current user interface (e.g., lacks resmico bam2feat). So, we do not recommend using the bioconda recipe for installing resmico at this time.

Running the ResMiCo package tests

Install pytest and pytest-console-scripts. For example:

mamba install pytest pytest-console-scripts

Run tests with pytest

pytest -s --hide-run-results --script-launch-mode=subprocess ./resmico/tests/

General usage

ResMiCo-SM snakemake pipeline

Use ResMiCo-SM for creating feature files from real data or simulate new data.

See the ResMiCo-SM README

Note resmico bam2feat can also be used to create feature tables from real data: contig fasta files & associated BAM files (mapped reads)

ResMiCo package

Main interface: resmico -h

Note: Although ResMiCo can be run on a CPU, it is orders of magnitude slower than on a GPU, so we only recommend running on CPUs for testing.

Creating feature tables

See resmico bam2feat -h

Predicting with existing model

See resmico evaluate -h

Filtering out misassembled contigs

See resmico filter -h

Training a new model

See resmico train -h

Example 1: predicting misassemblies with the "default" model

If you already have metagenome reads mapped to your contigs, you can process your own data much like in this example.

The model was trained with data produced via mapping Illumina paired-end reads with Bowtie2.

Working directory

mkdir example1 && cd example1

Get the example dataset

The dataset consists of a few UHGG genomes (MAGs) and associated BAM files. The BAM files were generated by using Bowtie2 to map the associated metagenome paired-end reads (from which the MAGs were assembled) to the MAG contigs.

So, the input consists of fasta files (contigs) and BAM files (mapped reads).

A simple tab-delimited table is used to map the fasta & BAM files.

Map file format:

* A tab-delim table with the columns (any order is allowed): 
  * `Taxon` => name associated with the fasta file of contigs
  * `Fasta` => path to the fasta file of contigs
  * `Sample` => name associated with the BAM file 
  * `BAM` => path to the BAM file of reads mapped to the contigs in `Fasta`

See the map.tsv file for an example.

wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_bam2feat.tar.gz
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_bam2feat.md5
md5sum --check UHGG-n9_bam2feat.md5
tar -pzxvf UHGG-n9_bam2feat.tar.gz && rm -f UHGG-n9_bam2feat.*

Convert BAM files to feature tables

Create a feature table for each sorted BAM file:

resmico bam2feat --outdir features UHGG-n9_bam2feat/map.tsv

Note: the parameters are the same as used for creating the "default" model from Mineeva et al., 2022, which is critical for getting accurate predictions.

Predict misassemblies

resmico evaluate \
  --min-avg-coverage 0.01 \
  --save-path predictions \
  --save-name default-model \
  --feature-files-path features

Note: --min-avg-coverage is set to "0.01" here due to the abnormally low coverage in these small example BAM files. DO NOT use such a low coverage cutoff with real data.

Filter contigs

Filter out contigs predicted to be misassembled

resmico filter \
  --outdir filtered predictions/default-model.csv \
  UHGG-n9_bam2feat/*.fna.gz

Note: change the --score-cutoff parameter to alter the number of contigs filtered.

Example2: Training & using a new model

Working directory

mkdir example2 && cd example2

Get the example dataset

Training data: simulated from 10 genomes in the GTDB

wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/genomes-n10_features.tar.gz
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/genomes-n10_features.md5
md5sum --check genomes-n10_features.md5
tar -pzxvf genomes-n10_features.tar.gz && rm -f genomes-n10_features.*

Test data: simulated from 9 genomes in the UHGG

wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_features.tar.gz
wget http://ftp.tue.mpg.de/ebio/projects/ResMiCo/UHGG-n9_features.md5
md5sum --check UHGG-n9_features.md5
tar -pzxvf UHGG-n9_features.tar.gz && rm -f UHGG-n9_features.*

Filter out contigs predicted to be misassembled

Filter out contigs with prediction scores below a specific cutoff.

resmico filter \
  --outdir filtered-contigs \
  predictions/default-model.csv \
  UHGG-n9_features/fasta/*fna.gz

You may need to adjust the --score-cutoff in order to filter some contigs

Training on the example train data

Train a new model with the example train dataset.

resmico train --log-progress --n-procs 4 --n-epochs 2 \
  --save-path model-n10 --stats-file='' \
  --save-name genomes-n10 \
  --feature-files-path genomes-n10_features

Predict using the "default" model

Using the "default" resmico model from the Mineeva et al., 2022 manuscript. Prediction on the example test data. This provides a comparison to our newly trained model.

resmico evaluate --n-procs 4 \
  --save-path predictions \
  --save-name default-model \
  --feature-files-path UHGG-n9_features/

Tutorials

See the wiki

Notes

Benchmarking

Model evaluation

Benchmarking resmico evaluate on the CAMI2-gut dataset:

  • One GPU (NVIDIA RTX A5000): 108 +/- 0.7 contigs per second
  • One CPU (AMD Epyc): 38.7 +/- 10.3 contigs per second

CAMI2-gut metagenome assembly stats:

  • No. of metagemes: 10
  • No. of contigs per sample (1000's): 18 +/- 6.4
  • Avg. contig length (kbp): 4.1 +/- 0.9

Training

We highly recommend using multiple GPUs for model training on large datasets, as done in the ResMiCo paper. Training on CPUs with such large datasets is not feasbile.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

resmico-1.2.2.tar.gz (6.8 MB view details)

Uploaded Source

Built Distributions

resmico-1.2.2-cp310-cp310-manylinux_2_31_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.31+ x86-64

resmico-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl (6.8 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

resmico-1.2.2-cp39-cp39-manylinux_2_31_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

resmico-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl (6.8 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

resmico-1.2.2-cp38-cp38-manylinux_2_31_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.31+ x86-64

resmico-1.2.2-cp38-cp38-macosx_10_9_x86_64.whl (6.8 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file resmico-1.2.2.tar.gz.

File metadata

  • Download URL: resmico-1.2.2.tar.gz
  • Upload date:
  • Size: 6.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for resmico-1.2.2.tar.gz
Algorithm Hash digest
SHA256 540d709749befa646a4822f4fb878192ed129ed8dbebc8abe67da805608b7bda
MD5 3a75bcf2f0437642c3bf24deff257631
BLAKE2b-256 7a78402664dfee78e167d94bf479fec25acf8c40581fe338de403b3d05e2df4b

See more details on using hashes here.

File details

Details for the file resmico-1.2.2-cp310-cp310-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for resmico-1.2.2-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 bcada0ce3f1241ff277fb1a99b70744d1ab5686f5ceb10b32ba7eeb7df91f06c
MD5 021db39cc3817194e538b1727bbc5968
BLAKE2b-256 e5c6c9c103770918df6808a4367be29f24901c5137287fd5a74c1b0048cbd70d

See more details on using hashes here.

File details

Details for the file resmico-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for resmico-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 59aa69c1c959d4e03a2666ea19243284ade6a6c25c1ee6327e7b29702aeaeb8f
MD5 3cf4f9ae775823bab2d52ea867c8bb48
BLAKE2b-256 2baf6106001e3e46708a6b9aa603845aaf3144d273e31c741e4b3a7de883abc6

See more details on using hashes here.

File details

Details for the file resmico-1.2.2-cp39-cp39-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for resmico-1.2.2-cp39-cp39-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 18aa9437367dcd8b4276c6803622fef5888ff36a1e4a6dfb5542d856ac271115
MD5 aaf898e0944bff2e1041031bd92067db
BLAKE2b-256 943c04d3f4f87ee6b1ec7b7f49de32b50f44796cba308d3a979ce9cd87482dd5

See more details on using hashes here.

File details

Details for the file resmico-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for resmico-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6623ae8ce577a519de9b623f19aa9f713301677e853cea22acd6c4bfa8ef5dc7
MD5 20e929369395a057444bbe8744749252
BLAKE2b-256 127f98314fb516760c32245a7bf0122b1e6e6f19bc327b9174385de233ed9d48

See more details on using hashes here.

File details

Details for the file resmico-1.2.2-cp38-cp38-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for resmico-1.2.2-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 de58cfd1ac372e8c2f0a14c96bc4a5201ee6ff28433234dad9fa80993a092d6a
MD5 57f4f941eb14163c64b41c1b4efcdcf9
BLAKE2b-256 a535a258c77a84430cda3a722ed1369a0d7ae47f9f3d2b6fadca5055c0f023b7

See more details on using hashes here.

File details

Details for the file resmico-1.2.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for resmico-1.2.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f01e15ae0729e4f1cc0f9ecefe7ba6db5c9838f14e5118e7d1d9db70e6c15dc5
MD5 888411ce684a6f9e0ee55788f88a6f7b
BLAKE2b-256 9a06872f356b8c508a7586d50bb1750c78fd54b5842f05522cfa10e0c2f4191a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page