Filter for reads from taxa of interest using Kraken2/Centrifuge classification results
Project description
filter_classified_reads
Filter for reads from taxa of interest using Kraken2/Centrifuge classification results.
Free software: Apache Software License 2.0
Documentation: https://filter-classified-reads.readthedocs.io.
Features
Filter for union of reads classified to taxa of interest Kraken2 and Centrifuge (by default filter for Viral reads (taxid=10239))
Output unclassified reads along with reads from taxa of interest or exlude them with –exclude-unclassified
seqtk for quickly filtering reads and pbgzip for parallel block Gzip compression of output reads (recommended that these dependencies are installed with Conda)
Usage
Paired-end reads with classification results by both Kraken2 and Centrifuge
filter_classified_reads -i /path/to/reads/R1.fq \
-I /path/to/reads/R2.fq \
-o /path/to/reads/R1.filtered.fq.gz \
-O /path/to/reads/R2.filtered.fq.gz \
-k /path/to/kraken2/results.tsv \
-K /path/to/kraken2/kreport.tsv \
-c /path/to/centrifuge/results.tsv \
-C /path/to/centrifuge/kreport.tsv \
Using test data in tests/data/:
$ filter_classified_reads -i tests/data/SRR8207674_1.viral_unclassified.seqtk_seed42_n10000.fastq.gz \
-I tests/data/SRR8207674_2.viral_unclassified.seqtk_seed42_n10000.fastq.gz \
-o r1.fq.gz \
-O r2.fq.gz \
-k tests/data/SRR8207674-kraken2_results.tsv \
-K tests/data/SRR8207674-kraken2_report.tsv \
-c tests/data/SRR8207674-centrifuge_results.tsv \
-C tests/data/SRR8207674-centrifuge_kreport.tsv
You should see the following log information:
2019-04-16 13:40:34,114 INFO: Parsing centrifuge results into DataFrame [in target_classified_reads.py:49]
2019-04-16 13:40:34,168 INFO: Parsed n=12281 centrifuge result records into DataFrame from "tests/data/SRR8207674-centrifuge_results.tsv" [in target_classified_reads.py:57]
2019-04-16 13:40:34,172 INFO: Parsed n=298 centrifuge Kraken-style report records into DataFrame from "tests/data/SRR8207674-centrifuge_kreport.tsv" [in target_classified_reads.py:60]
2019-04-16 13:40:34,177 INFO: Found 7129 unclassified reads from Centrifuge results [in target_classified_reads.py:65]
2019-04-16 13:40:34,242 INFO: Found 231 unique viral Taxonomy IDs [in target_classified_reads.py:98]
2019-04-16 13:40:34,245 INFO: Found 2181 target reads from centrifuge results [in target_classified_reads.py:101]
2019-04-16 13:40:34,245 INFO: Parsing kraken2 results into DataFrame [in target_classified_reads.py:49]
2019-04-16 13:40:34,289 INFO: Parsed n=20000 kraken2 result records into DataFrame from "tests/data/SRR8207674-kraken2_results.tsv" [in target_classified_reads.py:57]
2019-04-16 13:40:34,293 INFO: Parsed n=139 kraken2 Kraken-style report records into DataFrame from "tests/data/SRR8207674-kraken2_report.tsv" [in target_classified_reads.py:60]
2019-04-16 13:40:34,295 INFO: Found 1737 unclassified reads from Centrifuge results [in target_classified_reads.py:65]
2019-04-16 13:40:34,325 INFO: Found 26 unique viral Taxonomy IDs [in target_classified_reads.py:98]
2019-04-16 13:40:34,331 INFO: Found 8345 target reads from kraken2 results [in target_classified_reads.py:101]
2019-04-16 13:40:34,332 INFO: Found N=1701 common unclassified reads by all classification methods. [in cli.py:110]
2019-04-16 13:40:34,333 INFO: Total viral reads=8357 [in util.py:37]
2019-04-16 13:40:34,333 INFO: Centrifuge found n=12 target reads not found with Kraken2 [in util.py:38]
2019-04-16 13:40:34,333 INFO: Kraken2 found n=6176 target reads not found with Centrifuge [in util.py:40]
2019-04-16 13:40:34,338 INFO: N=1701 reads unclassified by both Centrifuge and Kraken2. [in util.py:62]
2019-04-16 13:40:34,345 INFO: Writing n=9999 filtered reads from "tests/data/SRR8207674_1.viral_unclassified.seqtk_seed42_n10000.fastq.gz" to "r1.fq.gz" [in cli.py:129]
2019-04-16 13:40:34,957 INFO: Writing n=9999 filtered reads from "tests/data/SRR8207674_2.viral_unclassified.seqtk_seed42_n10000.fastq.gz" to "r2.fq.gz" [in cli.py:134]
2019-04-16 13:40:35,459 INFO: Done! [in cli.py:137]
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.2.0 (2019-09-23)
Use seqtk subseq instead of screed for pulling reads of interest from files
Added external dependencies for seqtk for pulling reads from input FASTQs and pbgzip for parallel block Gzip for output of Gzipped FASTQs
Removed screed Python dependency
0.1.0 (2019-04-15)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file filter_classified_reads-0.2.0.tar.gz
.
File metadata
- Download URL: filter_classified_reads-0.2.0.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34d386ac5ac8d53c24dad7e16056762f2ea6090505ea42d8e81a9e2af203519c |
|
MD5 | 874906de011be0dc8a34cda7ae89df12 |
|
BLAKE2b-256 | 4fffd708158d9a8d07c0b7fb0481e7d17643570a8836b2807ca4efe6639c1c8b |