Skip to main content

Filter a Illumina FASTQ file based on index sequence

Project description

filter_illumina_index

Filter a Illumina FASTQ file based on index sequence.

Reads a Illumina FASTQ file and compares the sequence index in the sample number position of the sequence identifier to a supplied sequence index. Entries that match the sequence index are filtered into the filtered file (if any) and entries that don't match are filtered into the unfiltered file (if any). Displays the count of total, filtered and unfiltered reads, as well as the number of mismatches found across all reads. Matching tolerating a certain number of mismatches (-m parameter), and gzip compression for input (detected on the basis of file extension) and output (specified using -c parameter) are supported.

Specifying an empty index, (-i "") enables 'passthrough' mode where all reads are directed to the output filtered file with no processing. Passthrough mode is useful if this program is part of a workflow that needs to be adapted to files that do not have a valid Illumina index, as it allows any processing of this program to be skipped.

For information on Illumina sequence identifiers in FASTQ files, see: http://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm

Usage details

usage: filter_illumina_index [-h] [--version] [-f FILTERED] [-u UNFILTERED] -i
                             INDEX [-m MISMATCHES] [-c] [-v]
                             inputfile

positional arguments:
  inputfile             Input FASTQ file, compression supported

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -f FILTERED, --filtered FILTERED
                        Output FASTQ file containing filtered (positive) reads
                        (default: None)
  -u UNFILTERED, --unfiltered UNFILTERED
                        Output FASTQ file containing unfiltered (negative)
                        reads (default: None)
  -m MISMATCHES, --mismatches MISMATCHES
                        Maximum number of mismatches to tolerate (default: 0)
  -c, --compressed      Compress output files (note: file extension not
                        modified) (default: False)
  -v, --verbose         Show verbose output (default: False)

required named arguments:
  -i INDEX, --index INDEX
                        Sequence index to filter for; if empty (i.e. "") then
                        program will run in "passthrough" mode with all reads
                        directed to filtered file with no processing (default:
                        None)

Example usage

The directory srv contains example reads in FASTQ and compressed FASTQ format with index GATCGTGT and one read with a mismatch.

To test, run:

filter_illumina_index srv/example_reads.fastq --index GATCGTGT --filtered var/filtered_reads.fastq --unfiltered var/unfiltered_reads.fastq

This will process srv/example_reads.fastq, matching to index GATCGTGT with no mismatches allowed (default). Reads matching this index will be saved to var/filtered_reads.fastq and those not matching this index will be saved to var/unfiltered_reads.fastq. In addition, the following output will be displayed:

Total reads: 30
Filtered reads: 29
Unfiltered reads: 1
 Reads with 0 mismatches: 29
 Reads with 1 mismatches: 1
 Reads with 2 mismatches: 0
 Reads with 3 mismatches: 0
 Reads with 4 mismatches: 0
 Reads with 5 mismatches: 0
 Reads with 6 mismatches: 0
 Reads with 7 mismatches: 0
 Reads with 8 mismatches: 0

Additional details

  • Author: Tet Woo Lee
  • Copyright: © 2018 Tet Woo Lee
  • Licence: GPLv3
  • Dependencies: Biopython, tested on v1.72

Change log

version 1.0.3 2020-01-04 : Added passthrough mode with empty index.

version 1.0.2 2018-12-19 : Shows statistics on number of mismatches found

version 1.0.1 2018-12-19 : Speed up number of mismatches calculation

version 1.0 2018-12-14 : Minor updates for PyPi and conda packaging

version 1.0.dev1 2018-12-13 : First working version

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filter_illumina_index-1.0.3.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filter_illumina_index-1.0.3-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file filter_illumina_index-1.0.3.tar.gz.

File metadata

  • Download URL: filter_illumina_index-1.0.3.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.2

File hashes

Hashes for filter_illumina_index-1.0.3.tar.gz
Algorithm Hash digest
SHA256 ed9506f24a4930e89e437b9b0f62402478836ef829a2b92a2f837986c3a7d485
MD5 1f1032ca72c66040d42e91f8830a302a
BLAKE2b-256 d1539d329c10e13e93b5147403c675ba6f35508ff92c565f3ebc5a47fef27c4a

See more details on using hashes here.

File details

Details for the file filter_illumina_index-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: filter_illumina_index-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.2

File hashes

Hashes for filter_illumina_index-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d5e2640a338f5cf46ef8d501d078bde23a1798498694a553084296d7b6cbf0a9
MD5 a05047c99af7a4f86befdc3169ee4240
BLAKE2b-256 0c3fb66b4053f96602907362f29794a48c9c0aa4e5cfda04d85509f86978d7de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page