Skip to main content

Filter for reads from taxa of interest using Kraken2/Centrifuge classification results

Project description

filter_classified_reads

https://img.shields.io/pypi/v/filter_classified_reads.svg https://travis-ci.org/peterk87/filter_classified_reads.svg?branch=master Documentation Status

Filter for reads from taxa of interest using Kraken2/Centrifuge classification results.

Features

  • Filter for union of reads classified to taxa of interest Kraken2 and Centrifuge (by default filter for Viral reads (taxid=10239))

  • Output unclassified reads along with reads from taxa of interest or exlude them with –exclude-unclassified

  • seqtk for quickly filtering reads and pbgzip for parallel block Gzip compression of output reads (recommended that these dependencies are installed with Conda)

Usage

Paired-end reads with classification results by both Kraken2 and Centrifuge

filter_classified_reads -i /path/to/reads/R1.fq \
                        -I /path/to/reads/R2.fq \
                        -o  /path/to/reads/R1.filtered.fq.gz \
                        -O  /path/to/reads/R2.filtered.fq.gz \
                        -k  /path/to/kraken2/results.tsv \
                        -K  /path/to/kraken2/kreport.tsv \
                        -c  /path/to/centrifuge/results.tsv \
                        -C  /path/to/centrifuge/kreport.tsv \

Using test data in tests/data/:

$ filter_classified_reads -i tests/data/SRR8207674_1.viral_unclassified.seqtk_seed42_n10000.fastq.gz \
                          -I tests/data/SRR8207674_2.viral_unclassified.seqtk_seed42_n10000.fastq.gz \
                          -o r1.fq.gz \
                          -O r2.fq.gz \
                          -k tests/data/SRR8207674-kraken2_results.tsv \
                          -K tests/data/SRR8207674-kraken2_report.tsv \
                          -c tests/data/SRR8207674-centrifuge_results.tsv \
                          -C tests/data/SRR8207674-centrifuge_kreport.tsv

You should see the following log information:

2019-04-16 13:40:34,114 INFO: Parsing centrifuge results into DataFrame [in target_classified_reads.py:49]
2019-04-16 13:40:34,168 INFO: Parsed n=12281 centrifuge result records into DataFrame from "tests/data/SRR8207674-centrifuge_results.tsv" [in target_classified_reads.py:57]
2019-04-16 13:40:34,172 INFO: Parsed n=298 centrifuge Kraken-style report records into DataFrame from "tests/data/SRR8207674-centrifuge_kreport.tsv" [in target_classified_reads.py:60]
2019-04-16 13:40:34,177 INFO: Found 7129 unclassified reads from Centrifuge results [in target_classified_reads.py:65]
2019-04-16 13:40:34,242 INFO: Found 231 unique viral Taxonomy IDs [in target_classified_reads.py:98]
2019-04-16 13:40:34,245 INFO: Found 2181 target reads from centrifuge results [in target_classified_reads.py:101]
2019-04-16 13:40:34,245 INFO: Parsing kraken2 results into DataFrame [in target_classified_reads.py:49]
2019-04-16 13:40:34,289 INFO: Parsed n=20000 kraken2 result records into DataFrame from "tests/data/SRR8207674-kraken2_results.tsv" [in target_classified_reads.py:57]
2019-04-16 13:40:34,293 INFO: Parsed n=139 kraken2 Kraken-style report records into DataFrame from "tests/data/SRR8207674-kraken2_report.tsv" [in target_classified_reads.py:60]
2019-04-16 13:40:34,295 INFO: Found 1737 unclassified reads from Centrifuge results [in target_classified_reads.py:65]
2019-04-16 13:40:34,325 INFO: Found 26 unique viral Taxonomy IDs [in target_classified_reads.py:98]
2019-04-16 13:40:34,331 INFO: Found 8345 target reads from kraken2 results [in target_classified_reads.py:101]
2019-04-16 13:40:34,332 INFO: Found N=1701 common unclassified reads by all classification methods. [in cli.py:110]
2019-04-16 13:40:34,333 INFO: Total viral reads=8357 [in util.py:37]
2019-04-16 13:40:34,333 INFO: Centrifuge found n=12 target reads not found with Kraken2 [in util.py:38]
2019-04-16 13:40:34,333 INFO: Kraken2 found n=6176 target reads not found with Centrifuge [in util.py:40]
2019-04-16 13:40:34,338 INFO: N=1701 reads unclassified by both Centrifuge and Kraken2. [in util.py:62]
2019-04-16 13:40:34,345 INFO: Writing n=9999 filtered reads from "tests/data/SRR8207674_1.viral_unclassified.seqtk_seed42_n10000.fastq.gz" to "r1.fq.gz" [in cli.py:129]
2019-04-16 13:40:34,957 INFO: Writing n=9999 filtered reads from "tests/data/SRR8207674_2.viral_unclassified.seqtk_seed42_n10000.fastq.gz" to "r2.fq.gz" [in cli.py:134]
2019-04-16 13:40:35,459 INFO: Done! [in cli.py:137]

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.2.0 (2019-09-23)

  • Use seqtk subseq instead of screed for pulling reads of interest from files

  • Added external dependencies for seqtk for pulling reads from input FASTQs and pbgzip for parallel block Gzip for output of Gzipped FASTQs

  • Removed screed Python dependency

0.1.0 (2019-04-15)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filter_classified_reads-0.2.0.tar.gz (1.6 MB view details)

Uploaded Source

File details

Details for the file filter_classified_reads-0.2.0.tar.gz.

File metadata

  • Download URL: filter_classified_reads-0.2.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for filter_classified_reads-0.2.0.tar.gz
Algorithm Hash digest
SHA256 34d386ac5ac8d53c24dad7e16056762f2ea6090505ea42d8e81a9e2af203519c
MD5 874906de011be0dc8a34cda7ae89df12
BLAKE2b-256 4fffd708158d9a8d07c0b7fb0481e7d17643570a8836b2807ca4efe6639c1c8b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page