Skip to main content

RNN based assembly HELEN. It works paired with MarginPolish.

Project description

H.E.L.E.N.

H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore)

Build Status


Pre-print of a paper describing the methods and overview of a suggested de novo assembly pipeline is now available:

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit


Overview

HELEN is a polisher intended to use for polishing human-genome assemblies. HELEN operates on the pileup summary generated by MarginPolish. MarginPolish uses a probabilistic graphical-model to encode read alignments through a draft assembly to find the maximum-likelihood consensus sequence. The graphical-model operates in run-length space, which helps to reduce errors in homopolymeric regions. MarginPolish can produce tensor-like summaries encapsulating the internal likelihood weights. The weights are assigned to each genomic position over multiple likely outcomes that is suitable for inference by a Deep Neural Network model.

HELEN uses a Recurrent-Neural-Network (RNN) based Multi-Task Learning (MTL) model that can predict a base and a run-length for each genomic position using the weights generated by MarginPolish.

© 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.
Computational Genomics Lab (CGL), University of California, Santa Cruz.

Why MarginPolish-HELEN ?

  • MarginPolish-HELEN outperforms other graph-based and Neural-Network based polishing pipelines.
  • Easily usable via Docker for both GPU and CPU.
  • Highly optimized pipeline that is faster than any other available polishing tool (~4 hours for HELEN).
  • We have sequenced-assembled-polished 11 samples to ensure robustness, runtime-consistency and cost-efficiency.
  • We tested GPU usage on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to ensure scalability.
  • Open source (MIT License).

Walkthrough

A demo walkthrough is available here: demo

Table of contents

Workflow

The workflow is as follows:

  • Generate an assembly with Shasta.
  • Create a mapping between reads and the assembly using Minimap2.
  • Use MarginPolish to generate the images.
  • Use HELEN to generate a polished consensus sequence.

pipeline.svg

Installation

We have docker support for both MarginPolish and HELEN. Users can install MarginPolish and HELEN on Ubuntu 18.04 or any other Linux-based system by following the instructions from our Installation Guide.

If you have locally installed MarginPolish-HELEN then please follow the Local Install Usage Guide

Usage

MarginPolish requires a draft assembly and a mapping of reads to the draft assembly. We commend using Shasta as the initial assembler and MiniMap2 for the mapping.

Step 1: Generate an initial assembly

Although any assembler can be used to generate the initial assembly, we highly recommend using Shasta.

Please see the quick start documentation to see how to use Shasta. Shasta requires memory intensive computing.

For a human size assembly, AWS instance type x1.32xlarge is recommended. It is usually available at a cost around $4/hour on the AWS spot market and should complete the human size assembly in a few hours, at coverage around 60x.

An assembly can be generated by running:

# you may need to convert the fastq to a fasta file
./shasta-Linux-0.1.0 --input <reads.fa> --output <path_to_shasta_output>

Step 2: Create an alignment between reads and shasta assembly

We recommend using MiniMap2 to generate the mapping between the reads and the assembly.

# we recommend using FASTQ as marginPolish uses quality values
# This command can run MiniMap2 with 32 threads, you can change the number as you like.
minimap2 -ax map-ont -t 32 shasta_assembly.fa reads.fq | samtools sort -@ 32 | samtools view -hb -F 0x104 > reads_2_assembly.bam
samtools index -@32 reads_2_assembly.bam

#  the -F 0x104 flag removes unaligned and secondary sequences

Step 3: Generate images using MarginPolish

Run MarginPolish using docker

MarginPolish can be used in a docker container. You can get the image from:

docker pull kishwars/margin_polish:latest
docker run kishwars/margin_polish:latest --help

To generate images with MarginPolish docker, first collect all your input data (shasta_assembly.fa, reads_2_assembly.bam, allParams.np.human.guppy-ff-235.json) to a directory i.e. </your/data/dir>. Then please run:

docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/margin_polish:latest reads_2_assembly.bam \
shasta_assembly.fa \
/opt/MarginPolish/params/<model_name.json> \
-t <number_of_threads> \
-o output/marginpolish_images \
-f

You can get the params.json from path/to/marginpolish/params/allParams.np.human.guppy-ff-235.json.

Step 4: Run HELEN

Download Model

Before running call_consensus.py please download the appropriate model suitable for your data. Please read our model guideline to understand which model to pick.

Get docker images (GPU)

Plase install CUDA 10.0 to run the GPU supported docker for HELEN.

sudo apt-get install nvidia-docker2
sudo docker pull kishwars/helen:0.0.1.gpu
sudo nvidia-docker run kishwars/helen:0.0.1.gpu call_consensus.py -h
Run call_consensus.py

Please gather all your data to a input directory. Then run call_consensus.py using the following command:

sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu call_consensus.py \
-i <marginpolish_images> \
-b <batch_size> \
-m <r941_flip235_v001.pkl> \
-o <output_dir/> \
-p <output_filename_prefix> \
-w 0 \
-t 1 \
-g

Arguments:
  -h, --help            show this help message and exit
  -i IMAGE_FILE, --image_file IMAGE_FILE
                        [REQUIRED] Path to a directory where all MarginPolish
                        generated images are.
  -m MODEL_PATH, --model_path MODEL_PATH
                        [REQUIRED] Path to a trained model (pkl file). Please
                        see our github page to see options.
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size for testing, default is 512. Please set to
                        512 or 1024 for a balanced execution time.
  -w NUM_WORKERS, --num_workers NUM_WORKERS
                        Number of workers to assign to the dataloader. Should
                        be 0 if using Docker.
  -t THREADS, --threads THREADS
                        Number of PyTorch threads to use, default is 1. This
                        may be helpful during CPU-only inference.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Path to the output directory.
  -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix for the output file. Default is:
                        HELEN_prediction
  -g, --gpu_mode        If set then PyTorch will use GPUs for inference.
Run stitch.py

Finally you can run stitch.py to get a consensus sequence:

sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu \
stitch.py \
-i <output_dir/helen_predictions_XX.hdf> \
-t <number_of_threads> \
-o <output_dir/> \
-p <output_prefix>

Arguments:
  -i INPUT_HDF, --input_hdf INPUT_HDF
                        [REQUIRED] Path to a HDF5 file that was generated
                        using call consensus.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        [REQUIRED] Path to the output directory.
  -t THREADS, --threads THREADS
                        [REQUIRED] Number of threads.
  -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix for the output file. Default is: HELEN_consensus
Get docker images (CPU) (not recommended)

If you want to try running the inference on CPU.

sudo docker pull kishwars/helen:0.0.1.cpu
sudo docker run kishwars/helen:0.0.1.cpu call_consensus.py -h
Run call_consensus.py (CPU)

Please gather all your data to a input directory. Then run call_consensus.py using the following command:

docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu call_consensus.py \
-i <marginpolish_images> \
-b <batch_size> \
-m <r941_flip235_v001.pkl> \
-o <output_dir/> \
-p <output_filename_prefix> \
-w 0 \
-t <number_of_threads>
Run stitch.py

Finally you can run stitch.py to get a consensus sequence:

docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu stitch.py \
-i <output_dir/helen_predictions_XX.hdf> \
-t <number_of_threads> \
-o <output_dir> \
-p <output_prefix>

Models

Released models

Change in the basecaller algorithm can directly affect the outcome of HELEN. We will release trained models with new basecallers as they come out.

Model Name Release Date Intended base-caller Link Comment
r941_flip231_v001.pkl 29/05/2019 Guppy 2.3.1 Model_link The model is trained on chr1-6 of CHM13
with Guppy 2.3.1 base called data.
r941_flip233_v001.pkl 29/05/2019 Guppy 2.3.3 Model_link The model is trained on autosomes of HG002 except
chr 20 with Guppy 2.3.3 base called data.
r941_flip235_v001.pkl 29/05/2019 Guppy 2.3.5 Model_link The model is trained on autosomes of HG002 except
chr 20 with Guppy 2.3.5 base called data.
r941_flip305_v001.pkl 06/11/2019 Guppy 3.0.5 Model_link The model is trained on autosomes of HG002 except
chr 20 with Guppy 3.0.5 base called data.

We have seen significant difference in the homopolymer base-calls between different basecallers. It is important to pick the right version for the best polishing results.

Confusion matrix of Guppy 2.3.1 on CHM13 chromosome X: guppy235

Model Schema

HELEN implements a Recurrent-Neural-Network (RNN) based Multi-task learning model with hard parameter sharing. It implements a sliding window method where it slides through the input sequence in chunks. As each input sequence is evaluated independently, it allows HELEN to use mini-batch during training and testing.

pipeline.svg

Runtime and Cost

MarginPolish-HELEN ensures runtime consistency and cost efficiency. We have tested our pipeline on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to ensure scalability.

We studied several samples of 50-60x coverage and created a suggestion framework for running the polishing pipeline. Please be advised that these are cost-optimized suggestions. For better run-time performance you can use more resources.

Google Cloud Platform (GCP)

For MarginPolish please use n1-standard-64 (64 vCPUs, 240GB RAM) instance.
Our estimated run-time is: 12 hours Estimated cost for MarginPolish: $33

For HELEN, our suggested instance type is:

  • Instance type: n1-standard-32 (32 vCPUs, 120GB RAM)
  • GPUs: 2 x NVIDIA Tesla P100
  • Disk: 2TB SSD
  • Cost: $4.65/hour

The estimated runtime with this instance type is 4 hours.
The estimated cost for HELEN is $28.

Total estimated run-time for polishing: 18 hours.
Total estimated cost for polishing: $61

Amazon Web Services (AWS)

For MarginPolish we recommend c5.18xlarge (72 CPU, 144GiB RAM) instance.
Our estimated run-time is: 12 hours Estimated cost for MarginPolish: $39

We recommend using p2.8xlarge instance type for HELEN. The configuration is as follows:

  • Instance type: p2.8xlarge (32 vCPUs, 488GB RAM)
  • GPUs: 8 x NVIDIA Tesla K80
  • Disk: 2TB SSD
  • Cost: $7.20/hour
  • Suggested AMI: Deep Learning AMI (Ubuntu) Version 23.0

The estimated runtime with this instance type: 4 hours
The estimated cost for HELEN is: $36

Total estimated run-time for polishing: 16 hours.
Total estimated cost for polishing: $75

Please see our detailed run-time case study documentation for better insight.

We also see significant improvement in time over other available polishing algorithm:

pipeline.svg

Results

We compared Medaka and HELEN as polishing pipelines on Shasta assembly with assess_assembly module available from Pomoxis. The summary of the quality we produce is here:

error_rate

We also see that MarginPolish-HELEN perform consistently across multiple assemblers.

Multiple_assembler_error_rate

Eleven high-quality assemblies

We have sequenced-assembled-polished 11 human genome assemblies at University of California, Santa Cruz with our pipeline. They can be downloaded from our google bucket.

For quick links, please copy a link from this table and you can run wget to download the files:

wget <link>

The twelve assemblies with their download links:

Sample name Download link
HG00733 HG00733_download_link
HG01109 HG01109_download_link
HG01243 HG01243_download_link
HG02055 HG02055_download_link
HG02080 HG02080_download_link
HG02723 HG02723_download_link
HG03098 HG03098_download_link
HG03492 HG03492_download_link
GM24143 GM24143_download_link
GM24149 GM24149_download_link
GM24385/HG002 GM24385_download_link

We also polished CHM13 genome assembly available from the Telomere-to-telomere consortium project.
CHM13 polished assembly is available for download from here: CHM13_download_link

Help

Please open a github issue if you face any difficulties.

Acknowledgement

We are thankful to Segey Koren and Karen Miga for their help with CHM13 data and evaluation.

We downloaded our data from Telomere-to-telomere consortium to evaluate our pipeline against CHM13.

We acknowledge the work of the developers of these packages:

Fun Fact

guppy235 guppy235

The name "HELEN" is inspired from the A.I. created by Tony Stark in the Marvel Comics (Earth-616). HELEN was created to control the city Tony was building named "Troy" making the A.I. "HELEN of Troy".

READ MORE: HELEN

© 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

helen-0.0.6.tar.gz (1.9 MB view hashes)

Uploaded Source

Built Distribution

helen-0.0.6-cp36-cp36m-macosx_10_9_x86_64.whl (528.3 kB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page