filter-classified-reads

Filter for reads from taxa of interest using Kraken2/Centrifuge classification results

These details have not been verified by PyPI

Project links

Homepage

Project description

filter_classified_reads

Filter for reads from taxa of interest using Kraken2/Centrifuge classification results.

Free software: Apache Software License 2.0
Documentation: https://filter-classified-reads.readthedocs.io.

Features

Filter for union of reads classified to taxa of interest Kraken2 and Centrifuge (by default filter for Viral reads (taxid=10239))
Output unclassified reads along with reads from taxa of interest or exlude them with –exclude-unclassified
seqtk for quickly filtering reads and pbgzip for parallel block Gzip compression of output reads (recommended that these dependencies are installed with Conda)

Usage

Paired-end reads with classification results by both Kraken2 and Centrifuge

filter_classified_reads -i /path/to/reads/R1.fq \
                        -I /path/to/reads/R2.fq \
                        -o  /path/to/reads/R1.filtered.fq.gz \
                        -O  /path/to/reads/R2.filtered.fq.gz \
                        -k  /path/to/kraken2/results.tsv \
                        -K  /path/to/kraken2/kreport.tsv \
                        -c  /path/to/centrifuge/results.tsv \
                        -C  /path/to/centrifuge/kreport.tsv \

Using test data in tests/data/:

$ filter_classified_reads -i tests/data/SRR8207674_1.viral_unclassified.seqtk_seed42_n10000.fastq.gz \
                          -I tests/data/SRR8207674_2.viral_unclassified.seqtk_seed42_n10000.fastq.gz \
                          -o r1.fq.gz \
                          -O r2.fq.gz \
                          -k tests/data/SRR8207674-kraken2_results.tsv \
                          -K tests/data/SRR8207674-kraken2_report.tsv \
                          -c tests/data/SRR8207674-centrifuge_results.tsv \
                          -C tests/data/SRR8207674-centrifuge_kreport.tsv

You should see the following log information:

2019-04-16 13:40:34,114 INFO: Parsing centrifuge results into DataFrame [in target_classified_reads.py:49]
2019-04-16 13:40:34,168 INFO: Parsed n=12281 centrifuge result records into DataFrame from "tests/data/SRR8207674-centrifuge_results.tsv" [in target_classified_reads.py:57]
2019-04-16 13:40:34,172 INFO: Parsed n=298 centrifuge Kraken-style report records into DataFrame from "tests/data/SRR8207674-centrifuge_kreport.tsv" [in target_classified_reads.py:60]
2019-04-16 13:40:34,177 INFO: Found 7129 unclassified reads from Centrifuge results [in target_classified_reads.py:65]
2019-04-16 13:40:34,242 INFO: Found 231 unique viral Taxonomy IDs [in target_classified_reads.py:98]
2019-04-16 13:40:34,245 INFO: Found 2181 target reads from centrifuge results [in target_classified_reads.py:101]
2019-04-16 13:40:34,245 INFO: Parsing kraken2 results into DataFrame [in target_classified_reads.py:49]
2019-04-16 13:40:34,289 INFO: Parsed n=20000 kraken2 result records into DataFrame from "tests/data/SRR8207674-kraken2_results.tsv" [in target_classified_reads.py:57]
2019-04-16 13:40:34,293 INFO: Parsed n=139 kraken2 Kraken-style report records into DataFrame from "tests/data/SRR8207674-kraken2_report.tsv" [in target_classified_reads.py:60]
2019-04-16 13:40:34,295 INFO: Found 1737 unclassified reads from Centrifuge results [in target_classified_reads.py:65]
2019-04-16 13:40:34,325 INFO: Found 26 unique viral Taxonomy IDs [in target_classified_reads.py:98]
2019-04-16 13:40:34,331 INFO: Found 8345 target reads from kraken2 results [in target_classified_reads.py:101]
2019-04-16 13:40:34,332 INFO: Found N=1701 common unclassified reads by all classification methods. [in cli.py:110]
2019-04-16 13:40:34,333 INFO: Total viral reads=8357 [in util.py:37]
2019-04-16 13:40:34,333 INFO: Centrifuge found n=12 target reads not found with Kraken2 [in util.py:38]
2019-04-16 13:40:34,333 INFO: Kraken2 found n=6176 target reads not found with Centrifuge [in util.py:40]
2019-04-16 13:40:34,338 INFO: N=1701 reads unclassified by both Centrifuge and Kraken2. [in util.py:62]
2019-04-16 13:40:34,345 INFO: Writing n=9999 filtered reads from "tests/data/SRR8207674_1.viral_unclassified.seqtk_seed42_n10000.fastq.gz" to "r1.fq.gz" [in cli.py:129]
2019-04-16 13:40:34,957 INFO: Writing n=9999 filtered reads from "tests/data/SRR8207674_2.viral_unclassified.seqtk_seed42_n10000.fastq.gz" to "r2.fq.gz" [in cli.py:134]
2019-04-16 13:40:35,459 INFO: Done! [in cli.py:137]

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.2.0 (2019-09-23)

Use seqtk subseq instead of screed for pulling reads of interest from files
Added external dependencies for seqtk for pulling reads from input FASTQs and pbgzip for parallel block Gzip for output of Gzipped FASTQs
Removed screed Python dependency

0.1.0 (2019-04-15)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.1

Sep 17, 2020

This version

0.2.0

Sep 26, 2019

0.1.0

Apr 17, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filter_classified_reads-0.2.0.tar.gz (1.6 MB view details)

Uploaded Sep 26, 2019 Source

File details

Details for the file filter_classified_reads-0.2.0.tar.gz.

File metadata

Download URL: filter_classified_reads-0.2.0.tar.gz
Upload date: Sep 26, 2019
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for filter_classified_reads-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`34d386ac5ac8d53c24dad7e16056762f2ea6090505ea42d8e81a9e2af203519c`
MD5	`874906de011be0dc8a34cda7ae89df12`
BLAKE2b-256	`4fffd708158d9a8d07c0b7fb0481e7d17643570a8836b2807ca4efe6639c1c8b`