Skip to main content

Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.

Project description

logo

conda pypi license  Master:citest-master Development:citest-dev rtd

Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.

Includes 33 bp k-mer SNV subtyping schemes for Salmonella enterica subsp. enterica serovars Heidelberg, Enteritidis, and Typhimurium genomes developed by Genevieve Labbe et al., and for S. ser Typhi adapted from Wong et al. (https://www.nature.com/articles/ncomms12827).

Works on genome assemblies (FASTA files) or reads (FASTQ files)! Accepts Gzipped FASTA/FASTQ files as input!

Also includes a Mycobacterium tuberculosis lineage scheme adapted from Coll et al. (https://www.nature.com/articles/ncomms5812) by Daniel Kein.

Citation

If you find the biohansel tool useful, please cite as:

Rapid and robust genotyping of highly clonal bacterial pathogens using BioHansel, a SNP-based k-mer search pipeline. Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash. [Manuscript in preparation]

Read_The_Docs

More in-depth information on running and installing biohansel can be found on the biohansel readthedocs page.

Requirements and Dependencies

Each new build of biohansel is automatically tested on Linux using Continuous Integration. biohansel has been confirmed to work on Mac OSX (versions 10.13.5 Beta and 10.12.6) when installed with Conda.

These are the dependencies required for biohansel:

Installation

With Conda

Install biohansel from Bioconda with Conda (Conda installation instructions):

# setup Conda channels for Bioconda and Conda-Forge (https://bioconda.github.io/#set-up-channels)
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
# install biohansel
conda install bio_hansel

With pip from PyPI

Install biohansel from PyPI with pip:

pip install bio_hansel

With pip from Github

Or install the latest master branch version directly from Github:

pip install git+https://github.com/phac-nml/biohansel.git@master

Install into Galaxy (version >= 17.01)

Install biohansel from the main Galaxy toolshed:

https://toolshed.g2.bx.psu.edu/view/nml/biohansel/ba6a0af656a6

Usage

If you run hansel -h, you should see the following usage statement:

usage: hansel [-h] [-s SCHEME] [--scheme-name SCHEME_NAME]
              [-M SCHEME_METADATA] [-p forward_reads reverse_reads]
              [-i fasta_path genome_name] [-D INPUT_DIRECTORY]
              [-o OUTPUT_SUMMARY] [-O OUTPUT_KMER_RESULTS]
              [-S OUTPUT_SIMPLE_SUMMARY] [--force] [--json]
              [--min-kmer-freq MIN_KMER_FREQ]
              [--max-kmer-freq MAX_KMER_FREQ]
              [--low-cov-depth-freq LOW_COV_DEPTH_FREQ]
              [--max-missing-kmers MAX_MISSING_KMERS]
              [--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS]
              [--low-cov-warning LOW_COV_WARNING]
              [--max-intermediate-kmers MAX_INTERMEDIATE_KMERS]
              [--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS]
              [-v] [-V]
              [F [F ...]]

Subtype microbial genomes using SNV targeting k-mer subtyping schemes.
Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg, Enteritidis, Typhi, and Typhimurium subtyping. Also includes a Mycobacterium tuberculosis scheme called 'tb_lineage'.
Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, David Son, Darian Hole, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.

positional arguments:
  F                     Input genome FASTA/FASTQ files (can be Gzipped)

optional arguments:
  -h, --help            show this help message and exit
  -s SCHEME, --scheme SCHEME
                        Scheme to use for subtyping (built-in: "heidelberg",
                        "enteritidis", "typhi", "typhimurium", "tb_lineage"; OR user-specified:
                        /path/to/user/scheme)
  --scheme-name SCHEME_NAME
                        Custom user-specified SNP substyping scheme name
  -M SCHEME_METADATA, --scheme-metadata scheme_metadata
                        Scheme subtype metadata table (.TSV format accepted;
                        must contain column called "subtype")
  -p forward_reads reverse_reads, --paired-reads forward_reads reverse_reads
                        FASTQ paired-end reads
  -i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name
                        fasta file path to genome name pair
  -D INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                        directory of input fasta files (.fasta|.fa|.fna) or
                        FASTQ files (paired FASTQ should have same basename
                        with "_\d\.(fastq|fq)" postfix to be automatically
                        paired) (files can be Gzipped)
  -o OUTPUT_SUMMARY, --output-summary OUTPUT_SUMMARY
                        Subtyping summary output path (tab-delimited)
  -O OUTPUT_KMER_RESULTS, --output-kmer-results OUTPUT_KMER_RESULTS
                        Subtyping kmer matching output path (tab-delimited)
  -S OUTPUT_SIMPLE_SUMMARY, --output-simple-summary OUTPUT_SIMPLE_SUMMARY
                        Subtyping simple summary output path
  --force               Force existing output files to be overwritten
  --json                Output JSON representation of output files
  --min-kmer-freq MIN_KMER_FREQ
                        Min k-mer freq/coverage
  --max-kmer-freq MAX_KMER_FREQ
                        Max k-mer freq/coverage
  --low-cov-depth-freq LOW_COV_DEPTH_FREQ
                        Frequencies below this coverage are considered low
                        coverage
  --max-missing-kmers MAX_MISSING_KMERS
                        Decimal proportion of maximum allowable missing kmers
                        before being considered an error. (0.0 - 1.0)
  --min-ambiguous-kmers MIN_AMBIGUOUS_KMERS
                        Minimum number of missing kmers to be considered an
                        ambiguous result
  --low-cov-warning LOW_COV_WARNING
                        Overall kmer coverage below this value will trigger a
                        low coverage warning
  --max-intermediate-kmers MAX_INTERMEDIATE_KMERS
                        Decimal proportion of maximum allowable missing kmers
                        to be considered an intermediate subtype. (0.0 - 1.0)
  --max-degenerate-kmers MAX_DEGENERATE_KMERS
                        Maximum number of scheme k-mers allowed before
                        quitting with a usage warning. Default is 100,000
  -t THREADS, --threads THREADS
                        Number of parallel threads to run analysis (default=1)
  -v, --verbose         Logging verbosity level (-v == show warnings; -vvv ==
                        show debug info)
  -V, --version         show program's version number and exit

Example Usage

Analysis of a single FASTA file

hansel -s heidelberg -vv -o results.tab -O match_results.tab /path/to/SRR1002850.fasta

Contents of results.tab:

sample  scheme  subtype all_subtypes    kmers_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_kmers_matching_all    n_kmers_matching_all_total  n_kmers_matching_positive   n_kmers_matching_positive_total n_kmers_matching_subtype    n_kmers_matching_subtype_total  file_path
SRR1002850  heidelberg  2.2.2.2.1.4 2; 2.2; 2.2.2; 2.2.2.2; 2.2.2.2.1; 2.2.2.2.1.4  1037658-2.2.2.2.1.4; 2154958-2.2.2.2.1.4; 3785187-2.2.2.2.1.4   True        202 202 17  17  3   3   SRR1002850.fasta

Contents of match_results.tab:

kmername    stitle  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    qlen    slen    seq coverage    is_trunc    refposition subtype is_pos_kmer sample  file_path   scheme
775920-2.2.2.2  NODE_2_length_512016_cov_46.4737_ID_3   100.0   33  0   0   1   33  474875  474907  2.0000000000000002e-11  62.1    33  512016  GTTCAGGTGCTACCGAGGATCGTTTTTGGTGCG   1.0 False   775920  2.2.2.2 True    SRR1002850  SRR1002850.fasta   heidelberg
negative3305400-2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  276235  276267  2.0000000000000002e-11  62.1    33  427905  CATCGTGAAGCAGAACAGACGCGCATTCTTGCT   1.0 False   negative3305400 2.1.1.1 False   SRR1002850  SRR1002850.fasta   heidelberg
negative3200083-2.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  170918  170950  2.0000000000000002e-11  62.1    33  427905  ACCCGGTCTACCGCAAAATGGAAAGCGATATGC   1.0 False   negative3200083 2.1 False   SRR1002850  SRR1002850.fasta   heidelberg
negative3204925-2.2.3.1.5   NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  175760  175792  2.0000000000000002e-11  62.1    33  427905  CTCGCTGGCAAGCAGTGCGGGTACTATCGGCGG   1.0 False   negative3204925 2.2.3.1.5   False   SRR1002850  SRR1002850.fasta   heidelberg
negative3230678-2.2.2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  201513  201545  2.0000000000000002e-11  62.1    33  427905  AGCGGTGCGCCAAACCACCCGGAATGATGAGTG   1.0 False   negative3230678 2.2.2.1.1.1 False   SRR1002850  SRR1002850.fasta   heidelberg
negative3233869-2.1.1.1.1   NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  204704  204736  2.0000000000000002e-11  62.1    33  427905  CAGCGCTGGTATGTGGCTGCACCATCGTCATTA   1.0 False
[Next 196 lines omitted.]

Analysis of a single FASTQ readset

hansel -s heidelberg -vv -t 4 -o results.tab -O match_results.tab -p SRR5646583_forward.fastqsanger SRR5646583_reverse.fastqsanger

Contents of results.tab:

sample  scheme  subtype all_subtypes    kmers_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_kmers_matching_all    n_kmers_matching_all_total  n_kmers_matching_positive   n_kmers_matching_positive_total n_kmers_matching_subtype    n_kmers_matching_subtype_total  file_path
SRR5646583  heidelberg  2.2.1.1.1.1 2; 2.2; 2.2.1; 2.2.1.1; 2.2.1.1.1; 2.2.1.1.1.1  1983064-2.2.1.1.1.1; 4211912-2.2.1.1.1.1    True        202 202 20  20  2   2   SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger

Contents of match_results.tab:

seq freq    sample  file_path   kmername    is_pos_kmer subtype refposition is_kmer_freq_okay   scheme
ACGGTAAAAGAGGACTTGACTGGCGCGATTTGC   68  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    21097-2.2.1.1.1 True    2.2.1.1.1   21097   True    heidelberg
AACCGGCGGTATTGGCTGCGGTAAAAGTACCGT   77  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    157792-2.2.1.1.1    True    2.2.1.1.1   157792  True    heidelberg
CCGCTGCTTTCTGAAATCGCGCGTCGTTTCAAC   67  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    293728-2.2.1.1  True    2.2.1.1 293728  True    heidelberg
GAATAACAGCAAAGTGATCATGATGCCGCTGGA   91  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    607438-2.2.1    True    2.2.1   607438  True    heidelberg
CAGTTTTACATCCTGCGAAATGCGCAGCGTCAA   87  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    691203-2.2.1.1  True    2.2.1.1 691203  True    heidelberg
CAGGAGAAAGGATGCCAGGGTCAACACGTAAAC   33  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    944885-2.2.1.1.1    True    2.2.1.1.1   944885  True    heidelberg
[Next 200 lines omitted.]

Analysis of all FASTA/FASTQ files in a directory

hansel -s heidelberg -vv --threads <n_cpu> -o results.tab -O match_results.tab -D /path/to/fastas_or_fastqs/

biohansel will only attempt to analyze the FASTA/FASTQ files within the specified directory and will not descend into any subdirectories!

Metadata addition to analysis

Add subtype metadata to your analysis results with -M your-subtype-metadata.tsv:

hansel -s heidelberg -M your-subtype-metadata.tsv -o results.tab -O match_results.tab -D ~/your-reads-directory/

Your metadata table must contain a field with the field name subtype, e.g.

subtype host_association geoloc genotype_alternative
1 human Canada A
2 cow USA B

biohansel accepts metadata table files with the following formats and extensions:

Format Extension Example Filename
Tab-delimited table/tab-separated values (TSV) .tsv my-metadata-table.tsv
Tab-delimited table/tab-separated values (TSV) .tab my-metadata-table.tab
Comma-separated values (CSV) .csv my-metadata-table.csv

Development

Get the latest development code using Git from GitHub:

git clone https://github.com/phac-nml/biohansel.git
cd biohansel/
git checkout development
# Create a virtual environment (virtualenv) for development
virtualenv -p python3 .venv
# Activate the newly created virtualenv
source .venv/bin/activate
# Install biohansel into the virtualenv in "editable" mode
pip install -e .

Run tests with pytest:

# In the biohansel/ root directory, install pytest for running tests
pip install pytest
# Run all tests in tests/ directory
pytest
# Or run a specific test module
pytest -s tests/test_qc.py

Contact

Gary van Domselaar: gary.vandomselaar@phac-aspc.gc.ca

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for bio-hansel, version 2.4.0
Filename, size File type Python version Upload date Hashes
Filename, size bio_hansel-2.4.0.tar.gz (67.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page