Skip to main content

A fast FASTQ filter progam.

Project description

fastq-filter

A fast FASTQ filter program.

Fastq-filter correctly takes into account that quality scores are log scores when calculating the mean. It also provides an option to filter on average error rate directly.

FASTQ Q=30 stands for an average error rate of 0.001, Q=20 for 0.01 and Q=10 for 0.1. This is not very intuitive. Q=20 has 10 times more errors than Q=30 though the numbers (20 and 30) do little to convey this difference. Using 0.01 and 0.001 correctly conveys that these error rates are an order of magnitude apart. This also means that the phred scores cannot be naively averaged. Q=10 and Q=30 do not average Q=20. The actual average error rate is (0.001 + 0.1) / 2 = 0.0505. Roughly 1 in 20. Q=20 means 0.01: 1 in 100. By naively averaging the quality is overestimated by a factor of 5! This means any tool that averages naively is unusable in practice.

Unfortunately many tools do this. fastq-filter was written to provide a very fast filtering solution so the correct filtering can be applied at a very low cost.

Installation

  • With pip: pip install fastq-filter

  • For the latest development version: pip install git+https://github.com/LUMC/fastq-filter

  • With conda conda install -c conda-forge -c bioconda fastq-filter

Usage

Single fastq files can be filtered with:

fastq-filter -e 0.001 -o output.fastq input.fastq

Multiple fastq files can be filtered with:

fastq-filter -e 0.001 -o r1_filtered.fastq.gz -o r2_filtered.fastq.gz r1.fastq.gz r2.fastq.gz

Fastq-filter ensures the output is in sync. It is not limited to two inputs so also R1.fq, R2.fq and R3.fq can be filtered together.

In the following section ‘pair’ is used to note when 2 or more FASTQ records are evaluated. When multiple FASTQ files are given the filters behave as follows:

  • average error rate: The average of the combined phred scores is used.

  • median quality: The median of the combined phred scores is used.

  • Minimum length: at least one of the records of the pair must meet the minimum length.

  • Maximum length: None of the records in the pair must exceed the maximum length.

The rationale for the length filters is that R1 and R2 both sequence the same molecule and the canonical length is the longest of both.

usage: fastq-filter [-h] [-o OUTPUT] [-l MIN_LENGTH] [-L MAX_LENGTH]
                    [-e AVERAGE_ERROR_RATE] [-q MEAN_QUALITY]
                    [-Q MEDIAN_QUALITY] [-c COMPRESSION_LEVEL] [--verbose]
                    [--quiet]
                    input [input ...]

Filter FASTQ files on various metrics.

positional arguments:
  input                 Input FASTQ files. Compression format automatically
                        detected. Use - for stdin.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output FASTQ files. Compression format automatically
                        determined by file extension. Flag can be used
                        multiple times. An output must be given for each
                        input. Default: stdout.
  -l MIN_LENGTH, --min-length MIN_LENGTH
                        The minimum length for a read.
  -L MAX_LENGTH, --max-length MAX_LENGTH
                        The maximum length for a read.
  -e AVERAGE_ERROR_RATE, --average-error-rate AVERAGE_ERROR_RATE
                        The minimum average per base error rate.
  -q MEAN_QUALITY, --mean-quality MEAN_QUALITY
                        Average quality. Same as the '--average-error-rate'
                        option but specified with a phred score. I.e '-q 30'
                        is equivalent to '-e 0.001'.
  -Q MEDIAN_QUALITY, --median-quality MEDIAN_QUALITY
                        The minimum median phred score.
  -c COMPRESSION_LEVEL, --compression-level COMPRESSION_LEVEL
                        Compression level for the output files. Relevant when
                        output files have a .gz extension. Default: 2
  --verbose             Report stats on individual filters.
  --quiet               Turn of logging output.

Optimizations

fastq-filter has used the following optimizations to be fast:

  • Multiple filters can applied simultaneously to minimize IO.

  • fastq-filter can be used in pipes to minimize IO

  • The python filter function is used. Which is a a shorthand for python code that would otherwise need to be interpreted.

  • The mean and median quality algorithms are implemented in C with bindings to Python.

  • The mean quality algorithm uses a lookup table since there are only 93 possible phred scores encoded in FASTQ. That saves a lot of power calculations to calculate the probabilities.

  • The median quality algorithm implements a counting sort, which is really fast but not applicable for generic data. Since FASTQ qualities are uniquely suited for a counting sort, median calculation can be performed very quickly.

  • dnaio is used as FASTQ parser. This parses the FASTQ files with a parser written in Cython.

  • xopen is used to read and write files. This allows for support of gzip compressed files which are opened using python-isal which reads gzip files 2 times faster and writes gzip files 5 times faster than the python gzip module implementation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastq-filter-0.3.0.tar.gz (15.3 kB view details)

Uploaded Source

Built Distributions

fastq_filter-0.3.0-cp310-cp310-win_amd64.whl (25.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

fastq_filter-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl (44.5 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

fastq_filter-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (40.1 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fastq_filter-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl (23.1 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

fastq_filter-0.3.0-cp39-cp39-win_amd64.whl (25.6 kB view details)

Uploaded CPython 3.9 Windows x86-64

fastq_filter-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl (44.2 kB view details)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

fastq_filter-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.9 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fastq_filter-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl (23.1 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

fastq_filter-0.3.0-cp38-cp38-win_amd64.whl (25.6 kB view details)

Uploaded CPython 3.8 Windows x86-64

fastq_filter-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl (44.6 kB view details)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

fastq_filter-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (40.5 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fastq_filter-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl (23.1 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

fastq_filter-0.3.0-cp37-cp37m-win_amd64.whl (25.6 kB view details)

Uploaded CPython 3.7m Windows x86-64

fastq_filter-0.3.0-cp37-cp37m-musllinux_1_1_x86_64.whl (45.0 kB view details)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

fastq_filter-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.9 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fastq_filter-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl (23.0 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file fastq-filter-0.3.0.tar.gz.

File metadata

  • Download URL: fastq-filter-0.3.0.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for fastq-filter-0.3.0.tar.gz
Algorithm Hash digest
SHA256 8f28ea8124871b7c2cba2b558b7b64abdcbd4b0fdb7197a82056f7e6202b3e4b
MD5 987073c3b1814850a47dc662d75bfb8c
BLAKE2b-256 bd4d8f904f83fd497f1d801f6e83141f83b1a64f06c5daf73286d53622317ac4

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 980700422c14194078c901c917db11aac5ff4c93ef560c5141269d1c32f08630
MD5 46fac65c8219e3bd0cf73cf4b5439593
BLAKE2b-256 baadab04177901030fb76c52bad61a7940936a9408ab362fed076844dcf4e8b5

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 f3bbd79eedcf1fed26ee97cb9c5f5f28b42ad0aa99e04dce58969789f8f5c5fc
MD5 515b3a211d911f238286a6bf42b51ac1
BLAKE2b-256 75e9aac66c7b21a28aa77a7dc50a5a2acad0f3e3cac47e824442514b06e0f8c4

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 93657f271b4ee8d4ed25502f1101e74165480a1efb73efa629d023e9dc40c0fd
MD5 c2c0546555fe150c9316c5f61f6858a3
BLAKE2b-256 5e4da348e07362908fd7d1c3a2b60c88e1a68a5cef3d24ed4927654156e1aa73

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5e86320ea37d2a0ca304a87a41642850005fb3507016d68fb1e51d31e8d92f75
MD5 1f17dea3e71d76e02bf885af945ddcdd
BLAKE2b-256 b14b731a173ac732dc3119160de601483c2c68cc20836a121b43b356a213702a

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1116415621a7887db84edf272e10e1c356a6f91dbcbee48bc76fc89bec2ac831
MD5 638205973edd0e5cdec560f0f3b8fa66
BLAKE2b-256 f1d30075e5aed5da0b82fc99bbb5ed3f02cd607bbd51968dc2951c801801c1b0

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 234d417016fc837c5a78019f311a2eea7e19d30106a4bf3391603249b5d60684
MD5 442787fc6ef85af9c4ef1c79dbfc040b
BLAKE2b-256 ecd9c1b357a33ac28376d0d42616c03d543e9e559e879e595d770567320cf595

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9503cdf64706eca204b65210ac3c93de1780f9a83263868aff95d4cf106afe7a
MD5 1b0c77482750a171c8886b2354dba96f
BLAKE2b-256 dc9bdf47e94f90a78d5fdbbb4e5af5e9a373e54fece13c87c9633ae7e90a041d

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c7bdf9951c0f1613abf1b289ffc854895bdf4662f3c86de3710309d2f72761d5
MD5 ed6d14ee0a18129882374136e366c473
BLAKE2b-256 2bbf231d278979694ac5b5a399c1829fe4344d3ad255eac02fa4654cc52a41af

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 47477ce7300cafbe34628931445c3777e1fee9af514c8685c9d69c911a49de4c
MD5 dcb62c8a18411135487e9260ac0dbcf4
BLAKE2b-256 989f6b4bdc7a849872fb55d158a60c907a95a15f91815ce16f00391f3a485dfd

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 734624c16127fe79e868b075bfe57165f74027b74178fe3f44dc70a0241b520f
MD5 d18564e11327c85c88d14b8ddcabae55
BLAKE2b-256 7c407e5cb82f4e8118a1ed0d467e7a567b69919dc810fbd6d67c2abda3fa62bb

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 94fc096b7afcbdd3fb593f42920d1bdf9a1d68fa0b5a9bd0468f61c1b292d0e8
MD5 7f3eddc3c7d2ec7216503a6b9e38828c
BLAKE2b-256 96ec4ea0775c920e9c0e5981b16965ac860b885734c284735012e496b7ae71a1

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 96210e6f720a90cf5c34e84fe635adb30ebc9e051a667f36876b51da78cf6424
MD5 34a228033e5f0d24d229121dab87de9a
BLAKE2b-256 9f531bbfa4a7cafb704419977b6aad2e1e2c1162bb3c3f9aa89445fea2622f63

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 85b31242f3b6d2743791f459fc4b38af4c6229eed8aebd48011a1416669385ca
MD5 e4b73c6597c820dddc08091b4c5f91b2
BLAKE2b-256 0f713e9a1e7fb7518453e1f59f4b349824dcb146aec3079c3992ae0988b68b2a

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp37-cp37m-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp37-cp37m-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 a321f5cb6edbd71eae27d338ad1d65f43ede9108ded08e65e73ff83dad195aff
MD5 814ebb46fbbc04b4714e7a3b2ba4c0cd
BLAKE2b-256 3ab5a91a2dd07ea0d1da02fcebca6c0b021de3b6ae485c97c3faa36364610d56

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4ca01efc9435290bd0a998ffef6a681277f3532e3b85afcbb29aa5405a3e5a30
MD5 8ea9470d00292ea6553dda80a98627dd
BLAKE2b-256 01e00c2bd96051d5109076c6b1d95ea2b3b43ff3aa28c503652b96710d28f739

See more details on using hashes here.

File details

Details for the file fastq_filter-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for fastq_filter-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8e718c3046d6ddcf24ed4d64d3f7c54c4c371db2a49ed31c3fbbdd893ee82aff
MD5 bc502807d1c18e3b43a695f232d597c7
BLAKE2b-256 a094db95e30384508112d9ba90ffb393a104aa54d9a30d4428ccd614045f7c55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page