Skip to main content

A versatile sequence read processor for nanopore direct RNA sequencing

Project description

Poreplex

Signal-level preprocessor for Oxford Nanopore direct RNA sequencing (DRS) data. Poreplex does many preprocessing steps required before the downstream analyses for RNA Biology and yields the processed data in the ready-to-use forms.

Features

Installation

Poreplex requires Python 3.5+ and pip to install. This pip command installs poreplex with its essential dependencies. Currently, you need to install pomegranate manually before installing poreplex due to a memory leakage issue in the released versions of pomegranate. You may use the following command.

pip install cython && pip install git+https://github.com/jmschrei/pomegranate.git
pip install poreplex

To install it together with all optional dependencies (except albacore), use this command:

pip install 'poreplex[full]'

Additional (Optional) Dependency

As its inputs, poreplex requires the FAST5 files that were basecalled using ONT albacore in advance. Alternatively, poreplex can also internally call albacore during the processing without a prior basecalling if the albacore package is available from the environment.

Quick Start

Produce FASTQ files without 3′ adapter sequences from a bunch of FAST5 files.

poreplex -i path/to/fast5 -o path/to/output --trim-adapter

Four direct RNA sequencing libraries can be barcoded, pooled and sequenced together. Porplex can demultiplex the librariess into separate FASTQ files.

poreplex -i path/to/fast5 -o path/to/output --trim-adapter --barcoding

In addition, poreplex can create directories containing hard-links to the original FAST5 files, organized separately by the barcodes.

poreplex -i path/to/fast5 -o path/to/output --trim-adapter --barcoding --fast5

In case the FAST5 files are not basecalled yet, just a switch lets poreplex call albacore internally. Multicore machines help.

poreplex -i path/to/fast5 -o path/to/output --trim-adapter --barcoding --fast5 --basecall --parallel 40

With the --live switch, All tasks can be processed as soon as reads are produced from MinKNOW.

poreplex -i path/to/fast5 -o path/to/output --trim-adapter --barcoding --basecall --parallel 40 --live

One may want to output aligned reads directly to BAM files instead of FASTQ outputs. Poreplex streams the processed reads to minimap2 and update the BAM outputs real-time. A pre-built index (not a FASTA) generated using minimap2 must be provided for this.

poreplex -i path/to/fast5 -o path/to/output --trim-adapter --barcoding --basecall \
  --parallel 40 --live --align GRCz11-transcriptome.mmidx

More vibrant feedback is provided if you turn on the dashboard switch.

poreplex -i path/to/fast5 -o path/to/output --trim-adapter --barcoding --basecall \
  --parallel 40 --live --align GRCz11-transcriptome.mmidx --dashboard

By default, poreplex discards pseudo-fusion reads which may originate from insufficiently segmented signals. You can suppress the filtering by a switch.

poreplex -i path/to/fast5 -o path/to/output --keep-unsplit

Barcoding direct RNA sequencing libraries

The official kits and protocols does not support barcoding in the direct RNA sequencing yet. Poreplex enables pooling multiple libraries into a single DRS run.

ONT direct RNA sequencing libraries are prepared by subsequently attaching two different 3' adapters, RTA and RMX, respectively. Both are double-stranded DNAs with Y-burged ends on the 3'-sides. Barcoded libraries for poreplex can be built with modified versions of RTA adapters. Unlike in the DNA sequencing libraries, poreplex demultiplexes in signal-level to ensure the highest accuracy. The distribution contains demultiplexer models pre-trained with four different DNA barcodes. Order these sequences and replace the original RTA adapters as many as you need in the experiment.

BC1 Oligo A: 5'-/5Phos/CCTCCCCTAAAAACGAGCCGCATTTGCGTAGTAGGTTC-3'
BC1 Oligo B: 5'-GAGGCGAGCGGTCAATTTTCGCAAATGCGGCTCGTTTTTAGGGGAGGTTTTTTTTTT-3'
BC2 Oligo A: 5'-/5Phos/CCTCGTCGGTTCTAGGCATCGCGTATGCTAGTAGGTTC-3'
BC2 Oligo B: 5'-GAGGCGAGCGGTCAATTTTGCATACGCGATGCCTAGAACCGACGAGGTTTTTTTTTT-3'
BC3 Oligo A: 5'-/5Phos/CCTCCCACTTTCACACGCACTAACCAGGTAGTAGGTTC-3'
BC3 Oligo B: 5'-GAGGCGAGCGGTCAATTTTCCTGGTTAGTGCGTGTGAAAGTGGGAGGTTTTTTTTTT-3'
BC4 Oligo A: 5'-/5Phos/CCTCCTTCAGAAGAGGGTCGCTTCTACCTAGTAGGTTC-3'
BC4 Oligo B: 5'-GAGGCGAGCGGTCAATTTTGGTAGAAGCGACCCTCTTCTGAAGGAGGTTTTTTTTTT-3'

Basecalling with the ONT Albacore

Most studies requiring signal-level analysis need re-basecalling with the ONT albacore to get the event tables in the FAST5 files. Poreplex can internally call the basecaller core routines of albacore to yield the sequences and tables for the downstream analyses. In fact, running albacore via poreplex is remarkably faster than running albacore itself in a multi-core machine thanks to more efficient scheduling of the computational loads.

Live basecalling and processing

One can start the poreplex pipeline at any time even before the sequencing begins. With the --live switch turned on, it monitors every update in input directories and picks the newly created files up for the whole process of the analysis. In the live mode, the program keeps running unless a user presses Ctrl-C (in the standard progress view) or Q (in the full-screen dashboard view). The inotify module is required to allow poreplex to run in the live mode.

In case the points of sequencing and analysis are different, a real-time directory synchronization software like DirSync Pro may help. Poreplex detects new files introduced by moving or closing a file after writing. Files that are made visible by creating hard or symbolic links or changing permissions may remain undetected.

Real-time sequence alignments

Poreplex aligns the reads to the reference transcriptome using minimap2 and writes the results to BAM files when the index file for the reference is provided. Some options that affect the performance of the alignments can be specified on generating the minimap2 index.

wget 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.transcripts.fa.gz'
minimap2 -H -k 13 -w 10 -d gencode.v28.transcripts.mmidx gencode.v28.transcripts.fa.gz
poreplex -i /path/to/input -o /path/output --basecall --align gencode.v28.transcripts.mmidx

By default, switching on the alignments suppresses the FASTQ outputs. Those can be recovered by adding --fastq to the command line.

Real-time reports

The results from the real-time alignments with the overall progression of the pipeline can be visualized as a full-screen dashboard view in a text terminal. Poreplex shows the real-time report when the command line includes the --dashboard switch along with --align for the index of the reference transcriptome. The names of mapped sequences are shown with the sequence name in the reference minimap2 index. To see them in more familiar names, supply a file containing IDs and names with the --contig-aliases switch. The files must be a tab-separated text file with two columns consists of ID (in the reference index) and name (to show in the screen) in a line. The read counts window in the middle of the screen represents the summary of reads categorized by error status or detected barcodes for multiplexed libraries. Users can choose a group to show in the window with up and down keys. To stop the process and close the dashboard, press Q key at any time.

Pseudo-fusion filter

In the Oxford Nanopore strand sequencing, a read is a snippet from a very long contiguous signal from a channel. In most cases, there should be enough gap between two different molecules and the control software cuts signals at the end of sequences. However, the gap between strands is sometimes not enough. A small fraction of reads is found to carry two or more molecules which can be identified by the characteristic signal pattern in the DNA adapters. This phenomenon can be particularly problematic in the pooled libraries with barcodes and the fusion gene studies. In a few runs in our testing, up to 1% of reads could be derived from insufficiently segmented signals. The following plot shows a signal sequence continued without any gap between the ends of two different adapted RNAs which were prepared independently and pooled for sequencing in the final step.

Poreplex detects potential artifacts with multiple appearances of the signature of the DNA adapter in a single read. The default parameters for the filtering can be too sensitive for some experiments. You can disable the pseudo-fusion filter with the --keep-unsplit option.

Output formats

FASTQ

Sequences and quality scores are written to bgzip-compressed FASTQ files in the fastq subdirectory. Each FASTQ file contains the entire sequences of a group classified by processing status and detected barcode.

File name Description
fastq/pass.fastq.gz All sequences with basecalled and passed the basic quality filters in poreplex. With --barcoding, the passed sequences that were not detected of a barcode comes to this file.
fastq/BC#.fastq.gz Sequences with identifiable barcode signals.
fastq/fail.fastq.gz Too short sequences that could not be calibrated for the signal processing.
fastq/artifact.fastq.gz Sequences that were classified as potential artifacts.

FASTQ outputs are suppressed when BAM outputs are activated with the --align option. Please add --fastq to restore the FASTQ outputs.

FAST5

To reduce the disk I/O, poreplex utilizes the links instead of copying the original FAST5 to append basecalled results to the file. With the --fast5 option, poreplex creates hard links of the original FAST5 files reorganized in subdirectories representing each processing status or barcode. Symbolic links are created in case the hard links are not possible or --symlink-fast5 is specified.

The basecalled events, which are stored in Analyses/Basecall_1D_00* of the standard FAST5 files, are written to the events subdirectory instead upon request by --dump-basecalled-events. The basecall event tables for all reads are accessible through a single HDF5 file, events/inventory.h5, by the read id. These tables include an additional scaled_mean column, which contains mean current levels scaled to match the ONT's reference pore model.

BAM

The alignments to the reference transcriptome go into BAM files inside the bam subdirectory. The reference sequences must be indexed using minimap2 before providing it with the --align option (see above). The BAM files are not sorted and not filtered thoroughly. FASTQ or FASTA sequence files can be generated from the BAM files without loss using bedtools. Please use these sequence alignments in the BAM files for quality checks and sketchy analyses only.

Nanopolish database

Nanopolish provides very convenient tools that help signal-level analyses. Poreplex provides a set of index files that are required to run the nanopolish commands. Add --nanopolish to a poreplex command line, then just skip nanopolish extract or nanopolish index commands in its tutorial, and proceed directly to the main steps.

Command line options

usage: poreplex -i DIR -o DIR [-c NAME] [--trim-adapter] [--keep-unsplit]
                [--barcoding] [--basecall] [--align INDEXFILE] [--live]
                [--live-delay SECONDS] [--fastq] [--fast5] [--symlink-fast5]
                [--nanopolish] [--dump-adapter-signals]
                [--dump-basecalled-events] [--dashboard]
                [--contig-aliases FILE] [-q] [-y] [-p COUNT] [--tmpdir DIR]
                [--batch-chunk SIZE] [--version] [-h]
Short option Long option Description
Data Settings
-i DIR --input DIR path to the directory with the input FAST5 files (required)
-o DIR --output DIR output directory path (Required)
-c NAME --config NAME path to signal processing configuration
Basic Processing Options
--trim-adapter trim 3′ adapter sequences from FASTQ outputs
--keep-unsplit don't remove unsplit reads fused of two or more RNAs in output
Optional Analyses
--barcoding sort barcoded reads into separate outputs
--basecall call the ONT albacore for basecalling on-the-fly
--align INDEXFILE align basecalled reads using minimap2 and create BAM files
Live Mode
--live monitor new files in the input directory
--live-delay SECONDS time to delay the start of analysis in live mode (default: 60)
Output Options
--fastq write to FASTQ files even when BAM files are produced
--fast5 link or copy FAST5 files to separate output directories
--symlink-fast5 create symbolic links to FAST5 files in output directories even when hard linking is possible
--nanopolish create a nanopolish readdb to enable access from nanopolish
--dump-adapter-signals dump adapter signal dumps for training
--dump-basecalled-events dump basecalled events to the output
User Interface
--dashboard show the full screen dashboard
--contig-aliases FILE path to a tab-separated text file for aliases to show as a contig names in the dashboard (see README)
-q --quiet suppress non-error messages
-y --yes suppress all questions
Pipeline Options
-p COUNT --parallel COUNT number of worker processes (default: 1)
--tmpdir DIR temporary directory for intermediate data
--batch-chunk SIZE number of files in a single batch (default: 128)
--version show program's version number and exit
-h --help show this help message and exit

Citing Poreplex

A pre-print is going to be uploaded soon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poreplex-0.2.tar.gz (245.2 kB view hashes)

Uploaded Source

Built Distribution

poreplex-0.2-py3-none-any.whl (251.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page