Filter a Illumina FASTQ file based on index sequence

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

filter_illumina_index

Filter a Illumina FASTQ file based on index sequence.

Reads a Illumina FASTQ file and compares the sequence index in the sample number position of the sequence identifier to a supplied sequence index. Entries that match the sequence index are filtered into the filtered file (if any) and entries that don't match are filtered into the unfiltered file (if any). Displays the count of total, filtered and unfiltered reads, as well as the number of mismatches found across all reads. Matching tolerating a certain number of mismatches (-m parameter), and compression for input and output are supported (detected on the basis of file extension).

Specifying an empty index, (-i "" or -i with no argument) enables 'passthrough' mode where all reads are directed to the output filtered file with no processing. Passthrough mode is useful if this program is part of a workflow that needs to be adapted to files that do not have a valid Illumina index, as it allows all processing of this program to be skipped.

Illumina index

For information on Illumina sequence identifiers in FASTQ files, see FASTQ Files from Illumina. This includes the following excerpt:

For the Undetermined FASTQ files only, the sequence observed in the index read 
is written to the FASTQ header in place of the sample number. This information 
can be useful for troubleshooting demultiplexing.

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

This script assumes the <sample number> value is the Illumina index. To ensure rapid processing, the value following the final colon(:) is used without confirming the sequence identifier conforms to the above format, or whether the index is actually a nucleotide sequence.

Usage details

usage: filter_illumina_index [-h] [--version] [-f FILTERED] [-u UNFILTERED] -i
                             [INDEX] [-m MISMATCHES] [-t THREADS]
                             [-l {1,2,3,4,5,6,7,8,9}] [-v]
                             inputfile

positional arguments:
  inputfile             Input FASTQ file, compression (`.gz`, `.bz2` and
                        `.xz`) supported

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -f FILTERED, --filtered FILTERED
                        Output FASTQ file containing filtered (positive)
                        reads; compression detected by extension (default:
                        None)
  -u UNFILTERED, --unfiltered UNFILTERED
                        Output FASTQ file containing unfiltered (negative)
                        reads; compression detected by extension (default:
                        None)
  -m MISMATCHES, --mismatches MISMATCHES
                        Maximum number of mismatches to tolerate (default: 0)
  -t THREADS, --threads THREADS
                        Number of threads to pass to `xopen` for each open
                        file; use 0 to turn off `pigz` use and rely on
                        `gzip.open` so no extra threads spawned. (default: 1)
  -l {1,2,3,4,5,6,7,8,9}, --compresslevel {1,2,3,4,5,6,7,8,9}
                        Compression level for writing gzip files; ignored if
                        gzip compression not used (default: 6)
  -v, --verbose         Increase logging verbosity, available levels 1 to 2
                        with `-v` to `-vv` (default: 0)

required named arguments:
  -i [INDEX], --index [INDEX]
                        Sequence index to filter for; if empty (i.e. no
                        argument or "") then program will run in "passthrough"
                        mode with all reads directed to filtered file with no
                        processing (default: None)

Example usage

The directory filter_illumina_index/tests/data/ (within the location where the package is installed) contains various test files, for example test_reads_GATCGTGT.fastq with 29 reads containing barcode GATCGTGT and one read with a 1-nt mismatch of AATCGTGT. Running this program with the following command:

filter_illumina_index filter_illumina_index/tests/data/test_reads_GATCGTGT.fastq --index GATCGTGT --filtered /tmp/filtered_reads.fastq --unfiltered /tmp/unfiltered_reads.fastq

will process that input file with no mismatches allowed (default) for the index GATCGTGT. The output files are /tmp/filtered_reads.fastq with 29 reads containing the GATCGTGT barcode, and /tmp/unfiltered_reads.fastq with 1 read containing the mismatched-barcode AATCGTGT. The following log to stdout will be displayed:

filter_illumina_index 1.0.5
Input file: filter_illumina_index/tests/data/test_reads_GATCGTGT.fastq
Filtering for sequence index: GATCGTGT
Max mismatches tolerated: 0
Output filtered file: /tmp/filtered_reads.fastq
Output unfiltered file: /tmp/unfiltered_reads.fastq
Total reads: 30
Filtered reads: 29
Unfiltered reads: 1
 Reads with 0 mismatches: 29
 Reads with 1 mismatches: 1
 Reads with 2 mismatches: 0
 Reads with 3 mismatches: 0
 Reads with 4 mismatches: 0
 Reads with 5 mismatches: 0
 Reads with 6 mismatches: 0
 Reads with 7 mismatches: 0
 Reads with 8 mismatches: 0
 Reads with >=9 mismatches: 0

Example usage with two indexes

Version 1.0.5 adds support for two indexes separated by a user-specified separator. Reads are filtered by the total number of mismatches in both indexes. Alternatively, either index can be left blank in which case all reads for that index will be passed through, and only the non-blank index will be used for filtering.

The test file test_reads_GATCGTGT+TCTATCCT.fastq contains two indexes GATCGTGT and TCTATCCT separated by a + character, with 24 reads containing no mismatches, and one read each with the following number of mismatches for the two barcodes: (1,0), (0,1), (1,1), (2,1), (1,2), (2,2).

Running this program with the following command:

filter_illumina_index filter_illumina_index/tests/data/test_reads_GATCGTGT+TCTATCCT.fastq --index GATCGTGT --separator + --index2 TCTATCCT --filtered /tmp/filtered_reads.fastq --unfiltered /tmp/unfiltered_reads.fastq

will process that input file with no mismatches allowed (default) for the both indexes. The output files are /tmp/filtered_reads.fastq with 24 reads containing both the GATCGTGT and TCTATCCT barcodes, and /tmp/unfiltered_reads.fastq with 6 reads containing any mismatches. The following log to stdout will be displayed:

filter_illumina_index 1.0.5
Input file: filter_illumina_index/tests/data/test_reads_GATCGTGT+TCTATCCT.fastq
Filtering for sequence index 1: GATCGTGT
Filtering for sequence index 2: TCTATCCT
Separator between index 1 and 2: +
Max mismatches tolerated: 0
Output filtered file: /tmp/filtered_reads.fastq
Output unfiltered file: /tmp/unfiltered_reads.fastq
Total reads: 30
Filtered reads: 24
Unfiltered reads: 6
 Reads with 0 mismatches: 24
 Reads with 1 mismatches: 2
 Reads with 2 mismatches: 1
 Reads with 3 mismatches: 2
 Reads with 4 mismatches: 1
 Reads with 5 mismatches: 0
 Reads with 6 mismatches: 0
 Reads with 7 mismatches: 0
 Reads with 8 mismatches: 0
 Reads with >=9 mismatches: 0

Alternatively, the following command could be used to filter by the second barcode only:

filter_illumina_index filter_illumina_index/tests/data/test_reads_GATCGTGT+TCTATCCT.fastq --index --separator + --index2 TCTATCCT --filtered /tmp/filtered_reads.fastq --unfiltered /tmp/unfiltered_reads.fastq

Or (with "" used for --index):

filter_illumina_index filter_illumina_index/tests/data/test_reads_GATCGTGT+TCTATCCT.fastq --index "" --separator + --index2 TCTATCCT --filtered /tmp/filtered_reads.fastq --unfiltered /tmp/unfiltered_reads.fastq

This will process that input file with no mismatches allowed (default) for the second TCTATCCT index. The first indexes (before the +) is ignored. The output files are /tmp/filtered_reads.fastq with 25 reads containing the TCTATCCT barcodes with no mismatches, and /tmp/unfiltered_reads.fastq with 5 reads containing any mismatches in this barcode. The following log to stdout will be displayed:

filter_illumina_index 1.0.5
Input file: filter_illumina_index/tests/data/test_reads_GATCGTGT+TCTATCCT.fastq
Filtering for sequence index 1: (passthrough)
Filtering for sequence index 2: TCTATCCT
Separator between index 1 and 2: +
Max mismatches tolerated: 0
Output filtered file: /tmp/filtered_reads.fastq
Output unfiltered file: /tmp/unfiltered_reads.fastq
Total reads: 30
Filtered reads: 25
Unfiltered reads: 5
 Reads with 0 mismatches: 25
 Reads with 1 mismatches: 3
 Reads with 2 mismatches: 2
 Reads with 3 mismatches: 0
 Reads with 4 mismatches: 0
 Reads with 5 mismatches: 0
 Reads with 6 mismatches: 0
 Reads with 7 mismatches: 0
 Reads with 8 mismatches: 0
 Reads with >=9 mismatches: 0

A similar approach can be used to filter by the first barcode only using one of the following commands:

filter_illumina_index filter_illumina_index/tests/data/test_reads_GATCGTGT+TCTATCCT.fastq --index GATCGTGT --separator + --index2 --filtered /tmp/filtered_reads.fastq --unfiltered /tmp/unfiltered_reads.fastq

Or:

filter_illumina_index filter_illumina_index/tests/data/test_reads_GATCGTGT+TCTATCCT.fastq --index GATCGTGT --separator + --index2 "" --filtered /tmp/filtered_reads.fastq --unfiltered /tmp/unfiltered_reads.fastq

Note that both --separator and --index2 must be provided, although with the latter left blank.

The program does not support filtering by a certain number of mismatches for each barcode, only by the total. This functionality, however, could be obtained using two passes and processing each barcode separately with passthrough for the other barcode.

Algorithm details

The barcode is read from the sequence number position of the sequence identifier using a very simplistic method to optimise performance: the field in the sequence identifier following the last colon (:) character is used, e.g.

@FAKE-SEQ:1:FAKE-FLOWCELL-ID:1:1:0:1 1:N:0:TGACCAAT
                                          ^         : last colon
                                           ^^^^^^^^ : taken
                                           TGACCAAT : barcode

If there is no colon in the sequence identifier, an exception is produced as no barcode can be found, except in passthrough mode (which does no actual processing and redirects all reads to the filtered file). The number of mismatches is simply the number of characters that differ between the provided index sequence and the barcode from the read. Any missing characters in either the provided index or the barcode are counted as mismatches. The comparison is performed at the start of the each set of characters with no provision for wildcards, insertions or deletions, e.g.

TGACCAAT index
TGACCAAT read barcode
         0 mismatches

TGACCAAT index
AGACCAAT read barcode
^        1 mismatch

TGACCAAT index
GACCAAT  read barcode
^^^ ^ ^^ 6 mismatches

TGACCAAT index
TGACCAAAA read barcode
       ^^ 2 mismatches

TGACCAAT index
NNNCCAAT read barcode
^^^      3 mismatches

NNNCCAAT index
TGACCAAT read barcode
^^^      3 mismatches

NNNCCAAT index
NNNCCAAT read barcode
         0 mismatches

TGACCAAT index
TGACCAATTGACCAATTGACCAAT read barcode
        ^^^^^^^^^^^^^^^^ 16 mismatches

Where there is a greater number of characters in the read barcode than the provided index, the number of mismatches is summarised as >=index length+1, i.e. the final entry above will be counted as >=9 mismatches.

When two barcodes are present, the sequence identifier field is split into two substrings at the first instance of the character(s) provided as the separator. Then each barcode is processed with the rules above with the total number of mismatches used to filter reads. If either barcode is left blank, then only the non-blank barcode is used.

File reading/writing and threading

This script uses the dnaio and xopen packages for reading/writing FASTQ files with compression support. The xopen package used for reading/writing compressed files spawns pigz processes to speed-up processing. The --threads parameter indicates the number of threads passed to the xopen() function. With --threads 1 there is still a small amount of multithreading as a different pigz process is spawned for each open file and up to 3 files are open at once (one input and two output), although this is largely throttled by the sequential nature of the script processing. To prevent these pigz processes being spawned, --threads 0 can be used, which causes a fallback to gzip.open at an additional performance cost (it is slower than pigz).

In order for pigz to be used, it must be installed on the system, otherwise a gzip process is used. The pigz package is available on conda in the conda-forge channel, so can easily be installed in the same conda environment as this program.

xopen supports automatic compression according to the file extension, with .gz, .bz2 and .xz supported. Only the first of these has been tested with this script but the others are expected to work without issue.

Additional details

Author: Tet Woo Lee
Copyright: © 2018-2020 Tet Woo Lee
Licence: GPLv3
Dependencies:
dnaio, tested with v0.4.1
xopen, tested with v0.9.0

Change log

version 1.0.5 2023-12-14 Update to allow two indexes separated by user-specified separator, and allowing passthrough for each index. Bugfix counting mismatches if >= max tracked.

version 1.0.4r2 2020-04-11
Minor changes to README.md only, version not incremented in program.

version 1.0.4 2020-04-11
Speed up and algorithm changes

Switch to dnaio over Biopython to improve speed (>3x faster + multi- threading support for compression)
Change mismatch calculation algorithm, now includes any characters missing in filter-by index or read index
Exception if no barcode detected outside of passthrough mode
Add unit tests

version 1.0.3.post2 2020-04-01
Improved output

Add version number to output
Show parameters in output
Allow no argument for passthrough mode

version 1.0.3.post1 2020-04-01
Minor bugfix

Bugfix: Bump version number in script

version 1.0.3 2020-04-01
Added passthrough mode with empty index

version 1.0.2 2018-12-19
Shows statistics on number of mismatches found

version 1.0.1 2018-12-19
Speed up number of mismatches calculation

version 1.0 2018-12-14
Minor updates for PyPi and conda packaging

version 1.0.dev1 2018-12-13
First working version

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.5

Dec 14, 2023

1.0.4

Apr 11, 2020

1.0.4.dev3 pre-release

Apr 11, 2020

1.0.4.dev2 pre-release

Apr 10, 2020

1.0.4.dev1 pre-release

Apr 10, 2020

1.0.3.post2

Apr 1, 2020

1.0.3.post1

Apr 1, 2020

1.0.3

Apr 1, 2020

1.0.2

Dec 19, 2018

1.0.1

Dec 19, 2018

1.0

Dec 14, 2018

1.0.dev1 pre-release

Dec 13, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filter_illumina_index-1.0.5.tar.gz (39.1 kB view details)

Uploaded Dec 14, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filter_illumina_index-1.0.5-py3-none-any.whl (85.1 kB view details)

Uploaded Dec 14, 2023 Python 3

File details

Details for the file filter_illumina_index-1.0.5.tar.gz.

File metadata

Download URL: filter_illumina_index-1.0.5.tar.gz
Upload date: Dec 14, 2023
Size: 39.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for filter_illumina_index-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`782475ffec4a0841f165894e491ab90c0cb7dbbeb18358265143d99ab5b3d53a`
MD5	`30d64b51a446011676db2acb6de74d16`
BLAKE2b-256	`75842aec7f86b11b5e8abfeaa3d318a32c952a4f1d4902335073bcf734000056`

See more details on using hashes here.

File details

Details for the file filter_illumina_index-1.0.5-py3-none-any.whl.

File metadata

Download URL: filter_illumina_index-1.0.5-py3-none-any.whl
Upload date: Dec 14, 2023
Size: 85.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for filter_illumina_index-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8fe0548d69e7db5b3d63ec6ef5803e3a584b7fb17356566eb3a153064ee5013a`
MD5	`026308ca9517452e7b4944e238f481b9`
BLAKE2b-256	`af76bc5d2a5da99cf097232bdf27eb1e44abfd28fab770cdf79bcae541d06ea7`

See more details on using hashes here.

filter-illumina-index 1.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

filter_illumina_index

Filter a Illumina FASTQ file based on index sequence.

Illumina index

Usage details

Example usage

Example usage with two indexes

Algorithm details

File reading/writing and threading

Additional details

Change log

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes