Splitting of sequence reads by internal adapter sequence search.
Project description
Read Fillet
Read fillet is a simple utility for splitting Oxford Nanopore Sequencing reads based on knowledge of adapter sequences. Its primary use case is for splitting chimeric reads into their component sub-reads.
Installation
Read Fillet can be installed from PyPI with:
pip install read_fillet
Usage
Read Fillet is run simply with:
read_fillet <fastq_directory>
To see more options run:
read_fillet --help
For each input fastq file found in the input directory, a file with an
additional _split
suffix will be output into the output directory, controlled
by the --output_dir
option. If not --output_dir
is given then the output
files are placed alongside in the input files.
The new *_split.fastq
will contain two new reads for each read that was
split, now with suffix _1
and _2
Reads which were not split will also be
added to the new file, so that *_split.fastq
can be used as a match for any
downstream analysis. For example the output may look like:
@<read_id> <remaining_headers>
<sequence>
+
<quality>
@<read_id>_1 <remaining_headers>
<sequence>
+
<quality>
@<read_id>_2 <remaining_headers>
<sequence>
+
<quality>
Algorithmic Details
The purpose of read_fillet is to split reads likely to be concatemers of independent reads. Such concatemers may arise from at least two mechanisms:
- chemical chimers arising from artifacts of molecule-biological steps used in the preparation of sequencing libraries,
- informatic chimers arising from failures of the algorithms used to process the primary sequencing data.
Regardless of the mechanism that has created chimeric reads, read_fillet attempts to split such reads into their component sub-reads using knowledge of the sequencing adapters used in library preparation. Concatemeric reads typically contain one or more matches to a sequencing adapter internal to their sequence; for example, commonly half-way through the read when library read-length is from a tightly centered distribution. The following simplified views indicate the types of error that may occur.
Here is a read, where >>
represents the adapter sequence and =
represent
bases from the organism of interest. The reverse-complement of the sequencing
adapter can typically be found at the end of a read, represented as <<
below:
>>=====<<
A simple chimeric read containing two sub-reads can be represented in this notation as:
A B
>>=====<<>>=====<<
On Oxford Nanopore sequencing platforms, and enzyme is used to unwind double-stranded DNA and thread a single strand through a nanopore. It is possible that after the first strand of a duplex has transited the nanopore, that the second strand is immediately captured since it is physically close to the pore. If the capture is particularly fast the sequencing device software does not detect a boundary between the signals measured for the first and second strand and subsequently produced a single conjoined read. In this case the sub-read B will be complementary to A. It is possible also that the sub-read B is distinct from A, being derived from a second DNA duplex unrelated to that which gave rise to read A. In this case, depending on the sequencing library, the sequences of A and B can be unrelated.
Read-fillet looks for matches to the sequence <<>>
and splits the original basecall
for the read into separate parts, for example the read:
Before:
read_id: 0ae195a2-6993-4a0b-afa8-bb834f4739e3
A B
>>=====<<>>=====<<
^--^
wil become:
read_id: 0ae195a2-6993-4a0b-afa8-bb834f4739e3_1
A
>>=====<<
read_id: 0ae195a2-6993-4a0b-afa8-bb834f4739e3_2
B
>>=====<<
after the splitting process. Note the _1
and _2
in the emitted read UUIDs.
Assessment
An additional program read_fillet_assess
is available to provide
an assessment of the veracity of the results provided by the main read_fillet
program. The assessment program uses alignment of reads to a reference sequence
to form knowledge of the true structure of reads (assuming the reference sequence
to be correct and, for example, not contain structural variants with respect to
the sequences sample).
The assessment program requires the addition dependencies: pomoxis
,
samtools
, and seqkit
. These are most easily obtained using conda (or mamba
as a faster alternative):
mamba create --name read_fillet -c bioconda seqkit samtools pomoxis python3.6
conda activate read_fillet
pip install read_fillet
The program examines supplementary alignments produced by the minimap2
aligner. The
alignments are checked to determine their overlap to both the read and the reference
sequence. Reads are grouped into one of the following classes:
- single_alignment_95%cov: the read is associated with a single alignment spanning >95% of the reads length,
- disjoint_with_gap: two alignments are found for the read with an unaligned, adapter-sized gap excluded from the alignments,
- disjoint_without_gap: similar to the former, without an adapter-sized gap,
- overlapping: two alignments are found which overlap each other with respect to the reference.
- read_gt_2_supplementary: the read is associated with more than two alignments.
These categories are not mutually exclusive, if a read belongs to multiple classes the priority of labelling is from top to bottom as written above.
To assess the veracity of the adapter-based read_fillet
read splitting the assessment
program counts reads belonging to with- and without-gap classes as follows:
disjoint_with_gap
read_fillet
split: true positiveread_fillet
unsplit: false negative
disjoint_without_gap
read_fillet
split: false positiveread_fillet
unsplit: true negative
Help
Licence and Copyright
© 2021- Oxford Nanopore Technologies Ltd.
read_fillet
is distributed under the terms of the Mozilla Public License 2.0.
Research Release
Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file read_fillet-0.1.3.tar.gz
.
File metadata
- Download URL: read_fillet-0.1.3.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ab395ebcb6856b989c70138a8e2da4cd431898e47078b15ec51d3b139d9e8f9 |
|
MD5 | 1b1e4fd23d2e819ae4c48394b345ba0c |
|
BLAKE2b-256 | 7ce6f2b811445f6364e16e8914bdbc764e5a622f3ad6e6398904302df1c7dfb7 |