Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Filter a Illumina FASTQ file based on index sequence

Project description

# filter_illumina_index
## Filter a Illumina FASTQ file based on index sequence.

Reads a Illumina FASTQ file and compares the sequence index in the
`sample number` position of the sequence identifier to a supplied sequence
index. Entries that match the sequence index are filtered into the *filtered
file* (if any) and entries that don't match are filtered into the *unfiltered
file* (if any). Displays the count of total, filtered and unfiltered reads,
as well as the number of mismatches found across all reads. Matching tolerating
a certain number of mismatches (`-m` parameter), and gzip compression for input
(detected on the basis of file extension) and output (specified using `-c`
parameter) are supported.

For information on Illumina sequence identifiers in FASTQ files, see:

### Usage details

usage: filter_illumina_index [-h] [--version] [-f FILTERED] [-u UNFILTERED] -i
INDEX [-m MISMATCHES] [-c] [-v]

positional arguments:
inputfile Input FASTQ file, compression supported

optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-f FILTERED, --filtered FILTERED
Output FASTQ file containing filtered (positive) reads
(default: None)
Output FASTQ file containing unfiltered (negative)
reads (default: None)
Maximum number of mismatches to accept (default: 0)
-c, --compressed Compress output files (note: file extension not
modified) (default: False)
-v, --verbose Show verbose output (default: False)

required named arguments:
-i INDEX, --index INDEX
Sequence index to filter for (default: None)

### Example usage

The directory `srv` contains example reads in FASTQ and compressed FASTQ format with index `GATCGTGT` and one read with a mismatch.

To test, run:

`filter_illumina_index srv/example_reads.fastq --index GATCGTGT --filtered var/filtered_reads.fastq --unfiltered var/unfiltered_reads.fastq`

This will process `srv/example_reads.fastq`, matching to index `GATCGTGT` with no mismatches allowed (default). Reads matching this index will be saved to `var/filtered_reads.fastq` and those not matching this index will be saved to `var/unfiltered_reads.fastq`. In addition, the following output will be displayed:

Total reads: 30
Filtered reads: 29
Unfiltered reads: 1
Reads with 0 mismatches: 29
Reads with 1 mismatches: 1
Reads with 2 mismatches: 0
Reads with 3 mismatches: 0
Reads with 4 mismatches: 0
Reads with 5 mismatches: 0
Reads with 6 mismatches: 0
Reads with 7 mismatches: 0
Reads with 8 mismatches: 0


### Additional details

* Author: Tet Woo Lee
* Copyright: © 2018 Tet Woo Lee
* Licence: GPLv3
* Dependencies: Biopython, tested on v1.72

### Change log

version 1.0.2 2018-12-19
: Shows statistics on number of mismatches found

version 1.0.1 2018-12-19
: Speed up number of mismatches calculation

version 1.0 2018-12-14
: Minor updates for PyPi and conda packaging

version 1.0.dev1 2018-12-13
: First working version

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for filter-illumina-index, version 1.0.2
Filename, size File type Python version Upload date Hashes
Filename, size filter_illumina_index-1.0.2-py3-none-any.whl (30.2 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size filter_illumina_index-1.0.2.tar.gz (17.1 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page