Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.
Project description
Master |
|
Dev |
Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.
Includes 33 bp k-mer SNV subtyping schemes for Salmonella enterica subsp. enterica serovars Heidelberg, Enteritidis, and Typhimurium genomes developed by Genevieve Labbe et al., and for S. ser Typhi adapted from Wong et al., Britto et al., Rahman et al., and Klemm et al..
Works on genome assemblies (FASTA files) or reads (FASTQ files)! Accepts Gzipped FASTA/FASTQ files as input!
Also includes a Mycobacterium tuberculosis lineage scheme adapted from Coll et al. by Daniel Kein.
Citation
If you find the biohansel tool useful, please cite as:
Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel. Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash. bioRxiv 2020.01.10.902056; doi: https://doi.org/10.1101/2020.01.10.902056
Read_The_Docs
More in-depth information on running and installing biohansel can be found on the biohansel readthedocs page.
Requirements and Dependencies
Each new build of biohansel is automatically tested on Linux using Continuous Integration. biohansel has been confirmed to work on Mac OSX (versions 10.13.5 Beta and 10.12.6) when installed with Conda.
These are the dependencies required for biohansel:
- Python (>=v3.6)
numpy >=1.12.1
pandas >=0.20.1
pyahocorasick >=1.1.6
Installation
With Conda
Install biohansel from Bioconda with Conda (Conda installation instructions):
# setup Conda channels for Bioconda and Conda-Forge (https://bioconda.github.io/#set-up-channels)
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
# install biohansel
conda install bio_hansel
With pip from PyPI
Install biohansel from PyPI with pip:
pip install bio_hansel
With pip from Github
Or install the latest master branch version directly from Github:
pip install git+https://github.com/phac-nml/biohansel.git@master
Install into Galaxy (version >= 17.01)
Install biohansel from the main Galaxy toolshed:
https://toolshed.g2.bx.psu.edu/view/nml/biohansel/ba6a0af656a6
Usage
If you run hansel -h, you should see the following usage statement:
usage: hansel [-h] [-s SCHEME] [--scheme-name SCHEME_NAME]
[-M SCHEME_METADATA] [-p forward_reads reverse_reads]
[-i fasta_path genome_name] [-D INPUT_DIRECTORY]
[-o OUTPUT_SUMMARY] [-O OUTPUT_KMER_RESULTS]
[-S OUTPUT_SIMPLE_SUMMARY] [--force] [--json]
[--min-kmer-freq MIN_KMER_FREQ] [--min-kmer-frac MIN_KMER_FRAC]
[--max-kmer-freq MAX_KMER_FREQ]
[--low-cov-depth-freq LOW_COV_DEPTH_FREQ]
[--max-missing-kmers MAX_MISSING_KMERS]
[--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS]
[--low-cov-warning LOW_COV_WARNING]
[--max-intermediate-kmers MAX_INTERMEDIATE_KMERS]
[--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS] [-v]
[-V]
[F [F ...]]
BioHansel version 2.5.1: Subtype microbial genomes using SNV targeting k-mer subtyping schemes.
Built-in schemes:
* heidelberg: Salmonella enterica spp. enterica serovar Heidelberg
* enteritidis: Salmonella enterica spp. enterica serovar Enteritidis
* typhimurium: Salmonella enterica spp. enterica serovar Typhimurium
* typhi: Salmonella enterica spp. enterica serovar Typhi
* tb_lineage: Mycobacterium tuberculosis
Developed by Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash.
positional arguments:
F Input genome FASTA/FASTQ files (can be Gzipped)
optional arguments:
-h, --help show this help message and exit
-s SCHEME, --scheme SCHEME
Scheme to use for subtyping (built-in: "heidelberg",
"enteritidis", "typhi", "typhimurium", "tb_lineage";
OR user-specified: /path/to/user/scheme)
--scheme-name SCHEME_NAME
Custom user-specified SNP substyping scheme name
-M SCHEME_METADATA, --scheme-metadata SCHEME_METADATA
Scheme subtype metadata table (tab-delimited file with
".tsv" or ".tab" extension or CSV with ".csv"
extension format accepted; MUST contain column called
"subtype")
-p forward_reads reverse_reads, --paired-reads forward_reads reverse_reads
FASTQ paired-end reads
-i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name
input fasta file path AND genome name
-D INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
directory of input fasta files (.fasta|.fa|.fna) or
FASTQ files (paired FASTQ should have same basename
with "_\d\.(fastq|fq)" postfix to be automatically
paired) (files can be Gzipped)
-o OUTPUT_SUMMARY, --output-summary OUTPUT_SUMMARY
Subtyping summary output path (tab-delimited)
-O OUTPUT_KMER_RESULTS, --output-kmer-results OUTPUT_KMER_RESULTS
Subtyping kmer matching output path (tab-delimited)
-S OUTPUT_SIMPLE_SUMMARY, --output-simple-summary OUTPUT_SIMPLE_SUMMARY
Subtyping simple summary output path
--force Force existing output files to be overwritten
--json Output JSON representation of output files
--min-kmer-freq MIN_KMER_FREQ
Min k-mer freq/coverage
--min-kmer-frac MIN_KMER_FRAC
Proportion of k-mer required for detection (0.0 - 1)
--max-kmer-freq MAX_KMER_FREQ
Max k-mer freq/coverage
--low-cov-depth-freq LOW_COV_DEPTH_FREQ
Frequencies below this coverage are considered low
coverage
--max-missing-kmers MAX_MISSING_KMERS
Decimal proportion of maximum allowable missing kmers
before being considered an error. (0.0 - 1.0)
--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS
Minimum number of missing kmers to be considered an
ambiguous result
--low-cov-warning LOW_COV_WARNING
Overall kmer coverage below this value will trigger a
low coverage warning
--max-intermediate-kmers MAX_INTERMEDIATE_KMERS
Decimal proportion of maximum allowable missing kmers
to be considered an intermediate subtype. (0.0 - 1.0)
--max-degenerate-kmers MAX_DEGENERATE_KMERS
Maximum number of scheme k-mers allowed before
quitting with a usage warning. Default is 100000
-t THREADS, --threads THREADS
Number of parallel threads to run analysis (default=1)
-v, --verbose Logging verbosity level (-v == show warnings; -vvv ==
show debug info)
-V, --version show program's version number and exit
Example Usage
Analysis of a single FASTA file
hansel -s heidelberg -vv -o results.tab -O match_results.tab /path/to/SRR1002850.fasta
Contents of results.tab:
sample scheme subtype all_subtypes kmers_matching_subtype are_subtypes_consistent inconsistent_subtypes n_kmers_matching_all n_kmers_matching_all_total n_kmers_matching_positive n_kmers_matching_positive_total n_kmers_matching_subtype n_kmers_matching_subtype_total file_path
SRR1002850 heidelberg 2.2.2.2.1.4 2; 2.2; 2.2.2; 2.2.2.2; 2.2.2.2.1; 2.2.2.2.1.4 1037658-2.2.2.2.1.4; 2154958-2.2.2.2.1.4; 3785187-2.2.2.2.1.4 True 202 202 17 17 3 3 SRR1002850.fasta
Contents of match_results.tab:
kmername stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen seq coverage is_trunc refposition subtype is_pos_kmer sample file_path scheme
775920-2.2.2.2 NODE_2_length_512016_cov_46.4737_ID_3 100.0 33 0 0 1 33 474875 474907 2.0000000000000002e-11 62.1 33 512016 GTTCAGGTGCTACCGAGGATCGTTTTTGGTGCG 1.0 False 775920 2.2.2.2 True SRR1002850 SRR1002850.fasta heidelberg
negative3305400-2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 276235 276267 2.0000000000000002e-11 62.1 33 427905 CATCGTGAAGCAGAACAGACGCGCATTCTTGCT 1.0 False negative3305400 2.1.1.1 False SRR1002850 SRR1002850.fasta heidelberg
negative3200083-2.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 170918 170950 2.0000000000000002e-11 62.1 33 427905 ACCCGGTCTACCGCAAAATGGAAAGCGATATGC 1.0 False negative3200083 2.1 False SRR1002850 SRR1002850.fasta heidelberg
negative3204925-2.2.3.1.5 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 175760 175792 2.0000000000000002e-11 62.1 33 427905 CTCGCTGGCAAGCAGTGCGGGTACTATCGGCGG 1.0 False negative3204925 2.2.3.1.5 False SRR1002850 SRR1002850.fasta heidelberg
negative3230678-2.2.2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 201513 201545 2.0000000000000002e-11 62.1 33 427905 AGCGGTGCGCCAAACCACCCGGAATGATGAGTG 1.0 False negative3230678 2.2.2.1.1.1 False SRR1002850 SRR1002850.fasta heidelberg
negative3233869-2.1.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 204704 204736 2.0000000000000002e-11 62.1 33 427905 CAGCGCTGGTATGTGGCTGCACCATCGTCATTA 1.0 False
[Next 196 lines omitted.]
Analysis of a single FASTQ readset
hansel -s heidelberg -vv -t 4 -o results.tab -O match_results.tab -p SRR5646583_forward.fastqsanger SRR5646583_reverse.fastqsanger
Contents of results.tab:
sample scheme subtype all_subtypes kmers_matching_subtype are_subtypes_consistent inconsistent_subtypes n_kmers_matching_all n_kmers_matching_all_total n_kmers_matching_positive n_kmers_matching_positive_total n_kmers_matching_subtype n_kmers_matching_subtype_total file_path
SRR5646583 heidelberg 2.2.1.1.1.1 2; 2.2; 2.2.1; 2.2.1.1; 2.2.1.1.1; 2.2.1.1.1.1 1983064-2.2.1.1.1.1; 4211912-2.2.1.1.1.1 True 202 202 20 20 2 2 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger
Contents of match_results.tab:
seq freq sample file_path kmername is_pos_kmer subtype refposition is_kmer_freq_okay scheme
ACGGTAAAAGAGGACTTGACTGGCGCGATTTGC 68 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 21097-2.2.1.1.1 True 2.2.1.1.1 21097 True heidelberg
AACCGGCGGTATTGGCTGCGGTAAAAGTACCGT 77 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 157792-2.2.1.1.1 True 2.2.1.1.1 157792 True heidelberg
CCGCTGCTTTCTGAAATCGCGCGTCGTTTCAAC 67 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 293728-2.2.1.1 True 2.2.1.1 293728 True heidelberg
GAATAACAGCAAAGTGATCATGATGCCGCTGGA 91 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 607438-2.2.1 True 2.2.1 607438 True heidelberg
CAGTTTTACATCCTGCGAAATGCGCAGCGTCAA 87 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 691203-2.2.1.1 True 2.2.1.1 691203 True heidelberg
CAGGAGAAAGGATGCCAGGGTCAACACGTAAAC 33 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 944885-2.2.1.1.1 True 2.2.1.1.1 944885 True heidelberg
[Next 200 lines omitted.]
Analysis of all FASTA/FASTQ files in a directory
hansel -s heidelberg -vv --threads <n_cpu> -o results.tab -O match_results.tab -D /path/to/fastas_or_fastqs/
biohansel will only attempt to analyze the FASTA/FASTQ files within the specified directory and will not descend into any subdirectories!
Metadata addition to analysis
Add subtype metadata to your analysis results with -M your-subtype-metadata.tsv:
hansel -s heidelberg \
-M your-subtype-metadata.tsv \
-o results.tab \
-O match_results.tab \
-D ~/your-reads-directory/
Your metadata table must contain a field with the field name subtype, e.g.
subtype |
host_association |
geoloc |
genotype_alternative |
---|---|---|---|
1 |
human |
Canada |
A |
2 |
cow |
USA |
B |
biohansel accepts metadata table files with the following formats and extensions:
Format |
Extension |
Example Filename |
---|---|---|
Tab-delimited table/tab-separated values (TSV) |
.tsv |
my-metadata-table.tsv |
Tab-delimited table/tab-separated values (TSV) |
.tab |
my-metadata-table.tab |
Comma-separated values (CSV) |
.csv |
my-metadata-table.csv |
Development
Get the latest development code using Git from GitHub:
git clone https://github.com/phac-nml/biohansel.git
cd biohansel/
git checkout development
# Create a virtual environment (virtualenv) for development
virtualenv -p python3 .venv
# Activate the newly created virtualenv
source .venv/bin/activate
# Install biohansel into the virtualenv in "editable" mode
pip install -e .
Run tests with pytest:
# In the biohansel/ root directory, install pytest for running tests
pip install pytest
# Run all tests in tests/ directory
pytest
# Or run a specific test module
pytest -s tests/test_qc.py
Legal
Copyright Government of Canada 2017
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Contact
Gary van Domselaar: gary.vandomselaar@canada.ca
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bio_hansel-2.6.1.tar.gz
.
File metadata
- Download URL: bio_hansel-2.6.1.tar.gz
- Upload date:
- Size: 70.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd5d66e5952565dc6559cdc40d97482d41a3b8a30ea6e37d5ec564b5d8f3eee1 |
|
MD5 | d2566ae2184078499772958cb48006ef |
|
BLAKE2b-256 | 02e631507a349e8f6b33ff0fd8921d69dbc2e44ab5eedc3e1c319fddb2fa4ee3 |
File details
Details for the file bio_hansel-2.6.1-py2.py3-none-any.whl
.
File metadata
- Download URL: bio_hansel-2.6.1-py2.py3-none-any.whl
- Upload date:
- Size: 68.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e992d843fd3f5ff055eade6ff9aafd4187e64b288e6aa3a82ffb6f277750fd3 |
|
MD5 | 229c25ffa7613b3b5d04a060ae9ef7b1 |
|
BLAKE2b-256 | 3c5786c4a7f909696ceb6ca173677edda1e3f71ed1fec46eaa9de8e7af396b77 |