A fast FASTQ filter progam.
Project description
fastq-filter
A fast FASTQ filter program.
Fastq-filter correctly takes into account that quality scores are log scores when calculating the mean. It also provides an option to filter on average error rate directly.
FASTQ Q=30 stands for an average error rate of 0.001, Q=20 for 0.01 and Q=10 for 0.1. This is not very intuitive. Q=20 has 10 times more errors than Q=30 though the numbers (20 and 30) do little to convey this difference. Using 0.01 and 0.001 correctly conveys that these error rates are an order of magnitude apart. This also means that the phred scores cannot be naively averaged. Q=10 and Q=30 do not average Q=20. The actual average error rate is (0.001 + 0.1) / 2 = 0.0505. Roughly 1 in 20. Q=20 means 0.01: 1 in 100. By naively averaging the quality is overestimated by a factor of 5! This means any tool that averages naively is unusable in practice.
Unfortunately many tools do this. fastq-filter was written to provide a very fast filtering solution so the correct filtering can be applied at a very low cost.
Installation
With pip: pip install fastq-filter
For the latest development version: pip install git+https://github.com/LUMC/fastq-filter
With conda conda install -c conda-forge -c bioconda fastq-filter
Usage
Single fastq files can be filtered with:
fastq-filter -e 0.001 -o output.fastq input.fastq
Multiple fastq files can be filtered with:
fastq-filter -e 0.001 -o r1_filtered.fastq.gz -o r2_filtered.fastq.gz r1.fastq.gz r2.fastq.gz
Fastq-filter ensures the output is in sync. It is not limited to two inputs so also R1.fq, R2.fq and R3.fq can be filtered together.
In the following section ‘pair’ is used to note when 2 or more FASTQ records are evaluated. When multiple FASTQ files are given the filters behave as follows:
average error rate: The average of the combined phred scores is used.
median quality: The median of the combined phred scores is used.
Minimum length: at least one of the records of the pair must meet the minimum length.
Maximum length: None of the records in the pair must exceed the maximum length.
The rationale for the length filters is that R1 and R2 both sequence the same molecule and the canonical length is the longest of both.
usage: fastq-filter [-h] [-o OUTPUT] [-l MIN_LENGTH] [-L MAX_LENGTH]
[-e AVERAGE_ERROR_RATE] [-q MEAN_QUALITY]
[-Q MEDIAN_QUALITY] [-c COMPRESSION_LEVEL] [--verbose]
[--quiet]
input [input ...]
Filter FASTQ files on various metrics.
positional arguments:
input Input FASTQ files. Compression format automatically
detected. Use - for stdin.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output FASTQ files. Compression format automatically
determined by file extension. Flag can be used
multiple times. An output must be given for each
input. Default: stdout.
-l MIN_LENGTH, --min-length MIN_LENGTH
The minimum length for a read.
-L MAX_LENGTH, --max-length MAX_LENGTH
The maximum length for a read.
-e AVERAGE_ERROR_RATE, --average-error-rate AVERAGE_ERROR_RATE
The minimum average per base error rate.
-q MEAN_QUALITY, --mean-quality MEAN_QUALITY
Average quality. Same as the '--average-error-rate'
option but specified with a phred score. I.e '-q 30'
is equivalent to '-e 0.001'.
-Q MEDIAN_QUALITY, --median-quality MEDIAN_QUALITY
The minimum median phred score.
-c COMPRESSION_LEVEL, --compression-level COMPRESSION_LEVEL
Compression level for the output files. Relevant when
output files have a .gz extension. Default: 2
--verbose Report stats on individual filters.
--quiet Turn of logging output.
Optimizations
fastq-filter has used the following optimizations to be fast:
Multiple filters can applied simultaneously to minimize IO.
fastq-filter can be used in pipes to minimize IO
The python filter function is used. Which is a a shorthand for python code that would otherwise need to be interpreted.
The mean and median quality algorithms are implemented in C with bindings to Python.
The mean quality algorithm uses a lookup table since there are only 93 possible phred scores encoded in FASTQ. That saves a lot of power calculations to calculate the probabilities.
The median quality algorithm implements a counting sort, which is really fast but not applicable for generic data. Since FASTQ qualities are uniquely suited for a counting sort, median calculation can be performed very quickly.
dnaio is used as FASTQ parser. This parses the FASTQ files with a parser written in Cython.
xopen is used to read and write files. This allows for support of gzip compressed files which are opened using python-isal which reads gzip files 2 times faster and writes gzip files 5 times faster than the python gzip module implementation.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file fastq-filter-0.3.0.tar.gz
.
File metadata
- Download URL: fastq-filter-0.3.0.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f28ea8124871b7c2cba2b558b7b64abdcbd4b0fdb7197a82056f7e6202b3e4b |
|
MD5 | 987073c3b1814850a47dc662d75bfb8c |
|
BLAKE2b-256 | bd4d8f904f83fd497f1d801f6e83141f83b1a64f06c5daf73286d53622317ac4 |
File details
Details for the file fastq_filter-0.3.0-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 25.6 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 980700422c14194078c901c917db11aac5ff4c93ef560c5141269d1c32f08630 |
|
MD5 | 46fac65c8219e3bd0cf73cf4b5439593 |
|
BLAKE2b-256 | baadab04177901030fb76c52bad61a7940936a9408ab362fed076844dcf4e8b5 |
File details
Details for the file fastq_filter-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 44.5 kB
- Tags: CPython 3.10, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3bbd79eedcf1fed26ee97cb9c5f5f28b42ad0aa99e04dce58969789f8f5c5fc |
|
MD5 | 515b3a211d911f238286a6bf42b51ac1 |
|
BLAKE2b-256 | 75e9aac66c7b21a28aa77a7dc50a5a2acad0f3e3cac47e824442514b06e0f8c4 |
File details
Details for the file fastq_filter-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 40.1 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93657f271b4ee8d4ed25502f1101e74165480a1efb73efa629d023e9dc40c0fd |
|
MD5 | c2c0546555fe150c9316c5f61f6858a3 |
|
BLAKE2b-256 | 5e4da348e07362908fd7d1c3a2b60c88e1a68a5cef3d24ed4927654156e1aa73 |
File details
Details for the file fastq_filter-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl
- Upload date:
- Size: 23.1 kB
- Tags: CPython 3.10, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e86320ea37d2a0ca304a87a41642850005fb3507016d68fb1e51d31e8d92f75 |
|
MD5 | 1f17dea3e71d76e02bf885af945ddcdd |
|
BLAKE2b-256 | b14b731a173ac732dc3119160de601483c2c68cc20836a121b43b356a213702a |
File details
Details for the file fastq_filter-0.3.0-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 25.6 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1116415621a7887db84edf272e10e1c356a6f91dbcbee48bc76fc89bec2ac831 |
|
MD5 | 638205973edd0e5cdec560f0f3b8fa66 |
|
BLAKE2b-256 | f1d30075e5aed5da0b82fc99bbb5ed3f02cd607bbd51968dc2951c801801c1b0 |
File details
Details for the file fastq_filter-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 44.2 kB
- Tags: CPython 3.9, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 234d417016fc837c5a78019f311a2eea7e19d30106a4bf3391603249b5d60684 |
|
MD5 | 442787fc6ef85af9c4ef1c79dbfc040b |
|
BLAKE2b-256 | ecd9c1b357a33ac28376d0d42616c03d543e9e559e879e595d770567320cf595 |
File details
Details for the file fastq_filter-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 39.9 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9503cdf64706eca204b65210ac3c93de1780f9a83263868aff95d4cf106afe7a |
|
MD5 | 1b0c77482750a171c8886b2354dba96f |
|
BLAKE2b-256 | dc9bdf47e94f90a78d5fdbbb4e5af5e9a373e54fece13c87c9633ae7e90a041d |
File details
Details for the file fastq_filter-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 23.1 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7bdf9951c0f1613abf1b289ffc854895bdf4662f3c86de3710309d2f72761d5 |
|
MD5 | ed6d14ee0a18129882374136e366c473 |
|
BLAKE2b-256 | 2bbf231d278979694ac5b5a399c1829fe4344d3ad255eac02fa4654cc52a41af |
File details
Details for the file fastq_filter-0.3.0-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 25.6 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47477ce7300cafbe34628931445c3777e1fee9af514c8685c9d69c911a49de4c |
|
MD5 | dcb62c8a18411135487e9260ac0dbcf4 |
|
BLAKE2b-256 | 989f6b4bdc7a849872fb55d158a60c907a95a15f91815ce16f00391f3a485dfd |
File details
Details for the file fastq_filter-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 44.6 kB
- Tags: CPython 3.8, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 734624c16127fe79e868b075bfe57165f74027b74178fe3f44dc70a0241b520f |
|
MD5 | d18564e11327c85c88d14b8ddcabae55 |
|
BLAKE2b-256 | 7c407e5cb82f4e8118a1ed0d467e7a567b69919dc810fbd6d67c2abda3fa62bb |
File details
Details for the file fastq_filter-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 40.5 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94fc096b7afcbdd3fb593f42920d1bdf9a1d68fa0b5a9bd0468f61c1b292d0e8 |
|
MD5 | 7f3eddc3c7d2ec7216503a6b9e38828c |
|
BLAKE2b-256 | 96ec4ea0775c920e9c0e5981b16965ac860b885734c284735012e496b7ae71a1 |
File details
Details for the file fastq_filter-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 23.1 kB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96210e6f720a90cf5c34e84fe635adb30ebc9e051a667f36876b51da78cf6424 |
|
MD5 | 34a228033e5f0d24d229121dab87de9a |
|
BLAKE2b-256 | 9f531bbfa4a7cafb704419977b6aad2e1e2c1162bb3c3f9aa89445fea2622f63 |
File details
Details for the file fastq_filter-0.3.0-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 25.6 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85b31242f3b6d2743791f459fc4b38af4c6229eed8aebd48011a1416669385ca |
|
MD5 | e4b73c6597c820dddc08091b4c5f91b2 |
|
BLAKE2b-256 | 0f713e9a1e7fb7518453e1f59f4b349824dcb146aec3079c3992ae0988b68b2a |
File details
Details for the file fastq_filter-0.3.0-cp37-cp37m-musllinux_1_1_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp37-cp37m-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 45.0 kB
- Tags: CPython 3.7m, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a321f5cb6edbd71eae27d338ad1d65f43ede9108ded08e65e73ff83dad195aff |
|
MD5 | 814ebb46fbbc04b4714e7a3b2ba4c0cd |
|
BLAKE2b-256 | 3ab5a91a2dd07ea0d1da02fcebca6c0b021de3b6ae485c97c3faa36364610d56 |
File details
Details for the file fastq_filter-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 39.9 kB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ca01efc9435290bd0a998ffef6a681277f3532e3b85afcbb29aa5405a3e5a30 |
|
MD5 | 8ea9470d00292ea6553dda80a98627dd |
|
BLAKE2b-256 | 01e00c2bd96051d5109076c6b1d95ea2b3b43ff3aa28c503652b96710d28f739 |
File details
Details for the file fastq_filter-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fastq_filter-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 23.0 kB
- Tags: CPython 3.7m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e718c3046d6ddcf24ed4d64d3f7c54c4c371db2a49ed31c3fbbdd893ee82aff |
|
MD5 | bc502807d1c18e3b43a695f232d597c7 |
|
BLAKE2b-256 | a094db95e30384508112d9ba90ffb393a104aa54d9a30d4428ccd614045f7c55 |