RNN based assembly HELEN. It works paired with MarginPolish.
Project description
H.E.L.E.N.
H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore)
Pre-print of a paper describing the methods and overview of a suggested de novo assembly
pipeline is now available:
Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit
Overview
HELEN
is a polisher intended to use for polishing human-genome assemblies. HELEN
operates on the pileup summary generated by MarginPolish. MarginPolish
uses a probabilistic graphical-model to encode read alignments through a draft assembly to find the maximum-likelihood consensus sequence. The graphical-model operates in run-length space, which helps to reduce errors in homopolymeric regions. MarginPolish
can produce tensor-like summaries encapsulating the internal likelihood weights. The weights are assigned to each genomic position over multiple likely outcomes that is suitable for inference by a Deep Neural Network model.
HELEN
uses a Recurrent-Neural-Network (RNN) based Multi-Task Learning (MTL) model that can predict a base and a run-length for each genomic position using the weights generated by MarginPolish
.
© 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.
Computational Genomics Lab (CGL), University of California, Santa Cruz.
Why MarginPolish-HELEN ?
MarginPolish-HELEN
outperforms other graph-based and Neural-Network based polishing pipelines.- Easily usable via Docker for both
GPU
andCPU
. - Highly optimized pipeline that is faster than any other available polishing tool (~4 hours for
HELEN
). - We have sequenced-assembled-polished 11 samples to ensure robustness, runtime-consistency and cost-efficiency.
- We tested GPU usage on
Amazon Web Services (AWS)
andGoogle Cloud Platform (GCP)
to ensure scalability. - Open source (MIT License).
Walkthrough
A demo
walkthrough is available here: demo
Table of contents
- Workflow
- Installation
- Usage
- Models
- Runtime and Cost
- Results
- Eleven high-quality assemblies
- Help
- Acknowledgement
Workflow
The workflow is as follows:
- Generate an assembly with Shasta.
- Create a mapping between reads and the assembly using Minimap2.
- Use MarginPolish to generate the images.
- Use HELEN to generate a polished consensus sequence.
Installation
We have docker support for both MarginPolish
and HELEN
. Users can install MarginPolish
and HELEN
on Ubuntu 18.04
or any other Linux-based system by following the instructions from our Installation Guide.
If you have locally installed MarginPolish-HELEN
then please follow the Local Install Usage Guide
Usage
MarginPolish
requires a draft assembly and a mapping of reads to the draft assembly. We commend using Shasta
as the initial assembler and MiniMap2
for the mapping.
Step 1: Generate an initial assembly
Although any assembler can be used to generate the initial assembly, we highly recommend using Shasta.
Please see the quick start documentation to see how to use Shasta. Shasta requires memory intensive computing.
For a human size assembly, AWS instance type x1.32xlarge is recommended. It is usually available at a cost around $4/hour on the AWS spot market and should complete the human size assembly in a few hours, at coverage around 60x.
An assembly can be generated by running:
# you may need to convert the fastq to a fasta file
./shasta-Linux-0.1.0 --input <reads.fa> --output <path_to_shasta_output>
Step 2: Create an alignment between reads and shasta assembly
We recommend using MiniMap2
to generate the mapping between the reads and the assembly.
# we recommend using FASTQ as marginPolish uses quality values
# This command can run MiniMap2 with 32 threads, you can change the number as you like.
minimap2 -ax map-ont -t 32 shasta_assembly.fa reads.fq | samtools sort -@ 32 | samtools view -hb -F 0x104 > reads_2_assembly.bam
samtools index -@32 reads_2_assembly.bam
# the -F 0x104 flag removes unaligned and secondary sequences
Step 3: Generate images using MarginPolish
Run MarginPolish using docker
MarginPolish
can be used in a docker container. You can get the image from:
docker pull kishwars/margin_polish:latest
docker run kishwars/margin_polish:latest --help
To generate images with MarginPolish
docker, first collect all your input data (shasta_assembly.fa, reads_2_assembly.bam, allParams.np.human.guppy-ff-235.json
) to a directory i.e. </your/data/dir>
.
Then please run:
docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/margin_polish:latest reads_2_assembly.bam \
shasta_assembly.fa \
/opt/MarginPolish/params/<model_name.json> \
-t <number_of_threads> \
-o output/marginpolish_images \
-f
You can get the params.json
from path/to/marginpolish/params/allParams.np.human.guppy-ff-235.json
.
Step 4: Run HELEN
Download Model
Before running call_consensus.py
please download the appropriate model suitable for your data. Please read our model guideline to understand which model to pick.
Get docker images (GPU)
Plase install CUDA 10.0
to run the GPU supported docker for HELEN
.
sudo apt-get install nvidia-docker2
sudo docker pull kishwars/helen:0.0.1.gpu
sudo nvidia-docker run kishwars/helen:0.0.1.gpu call_consensus.py -h
Run call_consensus.py
Please gather all your data to a input directory. Then run call_consensus.py
using the following command:
sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu call_consensus.py \
-i <marginpolish_images> \
-b <batch_size> \
-m <r941_flip235_v001.pkl> \
-o <output_dir/> \
-p <output_filename_prefix> \
-w 0 \
-t 1 \
-g
Arguments:
-h, --help show this help message and exit
-i IMAGE_FILE, --image_file IMAGE_FILE
[REQUIRED] Path to a directory where all MarginPolish
generated images are.
-m MODEL_PATH, --model_path MODEL_PATH
[REQUIRED] Path to a trained model (pkl file). Please
see our github page to see options.
-b BATCH_SIZE, --batch_size BATCH_SIZE
Batch size for testing, default is 512. Please set to
512 or 1024 for a balanced execution time.
-w NUM_WORKERS, --num_workers NUM_WORKERS
Number of workers to assign to the dataloader. Should
be 0 if using Docker.
-t THREADS, --threads THREADS
Number of PyTorch threads to use, default is 1. This
may be helpful during CPU-only inference.
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Path to the output directory.
-p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Prefix for the output file. Default is:
HELEN_prediction
-g, --gpu_mode If set then PyTorch will use GPUs for inference.
Run stitch.py
Finally you can run stitch.py
to get a consensus sequence:
sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu \
stitch.py \
-i <output_dir/helen_predictions_XX.hdf> \
-t <number_of_threads> \
-o <output_dir/> \
-p <output_prefix>
Arguments:
-i INPUT_HDF, --input_hdf INPUT_HDF
[REQUIRED] Path to a HDF5 file that was generated
using call consensus.
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
[REQUIRED] Path to the output directory.
-t THREADS, --threads THREADS
[REQUIRED] Number of threads.
-p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Prefix for the output file. Default is: HELEN_consensus
Get docker images (CPU) (not recommended)
If you want to try running the inference on CPU.
sudo docker pull kishwars/helen:0.0.1.cpu
sudo docker run kishwars/helen:0.0.1.cpu call_consensus.py -h
Run call_consensus.py (CPU)
Please gather all your data to a input directory. Then run call_consensus.py
using the following command:
docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu call_consensus.py \
-i <marginpolish_images> \
-b <batch_size> \
-m <r941_flip235_v001.pkl> \
-o <output_dir/> \
-p <output_filename_prefix> \
-w 0 \
-t <number_of_threads>
Run stitch.py
Finally you can run stitch.py
to get a consensus sequence:
docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu stitch.py \
-i <output_dir/helen_predictions_XX.hdf> \
-t <number_of_threads> \
-o <output_dir> \
-p <output_prefix>
Models
Released models
Change in the basecaller algorithm can directly affect the outcome of HELEN. We will release trained models with new basecallers as they come out.
Model Name | Release Date | Intended base-caller | Link | Comment |
---|---|---|---|---|
r941_flip231_v001.pkl | 29/05/2019 | Guppy 2.3.1 | Model_link | The model is trained on chr1-6 of CHM13 with Guppy 2.3.1 base called data. |
r941_flip233_v001.pkl | 29/05/2019 | Guppy 2.3.3 | Model_link | The model is trained on autosomes of HG002 except chr 20 with Guppy 2.3.3 base called data. |
r941_flip235_v001.pkl | 29/05/2019 | Guppy 2.3.5 | Model_link | The model is trained on autosomes of HG002 except chr 20 with Guppy 2.3.5 base called data. |
r941_flip305_v001.pkl | 06/11/2019 | Guppy 3.0.5 | Model_link | The model is trained on autosomes of HG002 except chr 20 with Guppy 3.0.5 base called data. |
We have seen significant difference in the homopolymer base-calls between different basecallers. It is important to pick the right version for the best polishing results.
Confusion matrix of Guppy 2.3.1 on CHM13 chromosome X:
Model Schema
HELEN implements a Recurrent-Neural-Network (RNN) based Multi-task learning model with hard parameter sharing. It implements a sliding window method where it slides through the input sequence in chunks. As each input sequence is evaluated independently, it allows HELEN to use mini-batch during training and testing.
Runtime and Cost
MarginPolish-HELEN
ensures runtime consistency and cost efficiency. We have tested our pipeline on Amazon Web Services (AWS)
and Google Cloud Platform (GCP)
to ensure scalability.
We studied several samples of 50-60x coverage and created a suggestion framework for running the polishing pipeline. Please be advised that these are cost-optimized suggestions. For better run-time performance you can use more resources.
Google Cloud Platform (GCP)
For MarginPolish
please use n1-standard-64 (64 vCPUs, 240GB RAM) instance.
Our estimated run-time is: 12 hours
Estimated cost for MarginPolish
: $33
For HELEN
, our suggested instance type is:
- Instance type: n1-standard-32 (32 vCPUs, 120GB RAM)
- GPUs: 2 x NVIDIA Tesla P100
- Disk: 2TB SSD
- Cost: $4.65/hour
The estimated runtime with this instance type is 4 hours.
The estimated cost for HELEN
is $28.
Total estimated run-time for polishing: 18 hours.
Total estimated cost for polishing: $61
Amazon Web Services (AWS)
For MarginPolish
we recommend c5.18xlarge (72 CPU, 144GiB RAM) instance.
Our estimated run-time is: 12 hours
Estimated cost for MarginPolish
: $39
We recommend using p2.8xlarge
instance type for HELEN
. The configuration is as follows:
- Instance type: p2.8xlarge (32 vCPUs, 488GB RAM)
- GPUs: 8 x NVIDIA Tesla K80
- Disk: 2TB SSD
- Cost: $7.20/hour
- Suggested AMI: Deep Learning AMI (Ubuntu) Version 23.0
The estimated runtime with this instance type: 4 hours
The estimated cost for HELEN
is: $36
Total estimated run-time for polishing: 16 hours.
Total estimated cost for polishing: $75
Please see our detailed run-time case study documentation for better insight.
We also see significant improvement in time over other available polishing algorithm:
Results
We compared Medaka
and HELEN
as polishing pipelines on Shasta assembly with assess_assembly
module available from Pomoxis
. The summary of the quality we produce is here:
We also see that MarginPolish-HELEN
perform consistently across multiple assemblers.
Eleven high-quality assemblies
We have sequenced-assembled-polished 11 human genome assemblies at University of California, Santa Cruz with our pipeline. They can be downloaded from our google bucket.
For quick links, please copy a link from this table and you can run wget
to download the files:
wget <link>
The twelve assemblies with their download links:
Sample name | Download link |
---|---|
HG00733 | HG00733_download_link |
HG01109 | HG01109_download_link |
HG01243 | HG01243_download_link |
HG02055 | HG02055_download_link |
HG02080 | HG02080_download_link |
HG02723 | HG02723_download_link |
HG03098 | HG03098_download_link |
HG03492 | HG03492_download_link |
GM24143 | GM24143_download_link |
GM24149 | GM24149_download_link |
GM24385/HG002 | GM24385_download_link |
We also polished CHM13
genome assembly available from the Telomere-to-telomere consortium project.
CHM13
polished assembly is available for download from here: CHM13_download_link
Help
Please open a github issue if you face any difficulties.
Acknowledgement
We are thankful to Segey Koren and Karen Miga for their help with CHM13
data and evaluation.
We downloaded our data from Telomere-to-telomere consortium to evaluate our pipeline against CHM13
.
We acknowledge the work of the developers of these packages:
Fun Fact
The name "HELEN" is inspired from the A.I. created by Tony Stark in the Marvel Comics (Earth-616). HELEN was created to control the city Tony was building named "Troy" making the A.I. "HELEN of Troy".
READ MORE: HELEN
© 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file helen-0.0.6.tar.gz
.
File metadata
- Download URL: helen-0.0.6.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191101 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84d851c93251d3fcceb880bae3c3f60d0e0b57729358f07c1d6d3146d2c17440 |
|
MD5 | 902bb530c5b7d639b0708dd4495e8c9d |
|
BLAKE2b-256 | feb4fc653721ab1ee96d8ab8856c05671d6eb14a2890c277a1c53d43b026973e |
File details
Details for the file helen-0.0.6-cp36-cp36m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: helen-0.0.6-cp36-cp36m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 528.3 kB
- Tags: CPython 3.6m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191101 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba4df3f9844f7f618f8a7a6609d048d047f825383ce26247769feb7301c0bcc1 |
|
MD5 | 0044dd93a475102e51054f904573151c |
|
BLAKE2b-256 | c66cd5905b744861f345a9574a33de7f768503f9b13492d65d24d19312c5a117 |