Species identification pipeline for both single species and metagenomic samples.
Project description
HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data
Introduction
haystac
is a light-weight, fast, and user-friendly species identification tool. It evaluates the presence of a
particular species of interest in a metagenomic sample, and provides statistical support for the species assignment.
The method is designed to estimate the probability that a specific taxon is present in a metagenomic sample given a set
of sequencing reads and a database of reference genomes. It works equally well with both modern and ancient DNA sequence
data.
Setup
haystac
can be run on either macOS or Linux based systems.
The recommended way to install haystac
, and all if its dependencies, is via the mamba
package manager (a fast replacement for conda).
You can install haystac
using conda
, however, it will take significantly longer to install and analyses will run slower.
Install mamba
If you do not have either mamba
or conda
already installed, please refer to the install instructions for mambaforge
.
If you have conda
installed, but not mamba
, then install mamba
into the base environment:
conda install -n base -c conda-forge mamba
Install haystac
Then use mamba
to install haystac
into a new environment:
mamba create -c conda-forge -c bioconda -n haystac haystac
And activate the environment:
conda activate haystac
We recommend that you install haystac
into a new environment to avoid dependency conflicts with other software.
Quick Start
haystac
consists of three main modules:
database
for building a database of reference genomessample
for pre-processing of samples prior to analysisanalyse
for analysing a sample against a database
1. Build a database
To begin using haystac
we firstly need to construct a database containing all species of interest to our study. In our
preprint, we show that haystac
makes robust species
identifications with genus specific databases (for prokaryotes), allowing for very fast hypothesis driven analyses.
In this example, we will build a database containing all species in the Yersinia genus, by supplying haystac
with a
simple NBCI search query.
haystac database \
--mode build \
--query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
--output yersinia_db
To construct an NCBI search query for your area of interest, visit the NCBI Nucleotide database and use the search feature to obtain a correctly formatted query string from
the "Search details" box. This search query can be used directly with haystac
to automatically download and build
a reference database based on the accession codes present in the resultset returned by the query.
For more exhaustive analyses, you can build a database containing the 5,681 species present in the RefSeq representative database of prokaryotic species by running:
haystac database \
--mode build \
--refseq-rep prokaryote_rep \
--output refseq_db
Note: Building a database this big is not recommended on a laptop computer.
2. Prepare a sample for analysis
The second step in using haystac
is to prepare a sample for analysis.
In this example, we will download an aDNA library from Rasmussen et al. (2015),
by giving haystac
the SRA accession code ERR1018966.
Most published genomics papers include a BioProject code (e.g. PRJEB10885), from which you can obtain SRA accessions for each sequencing
library.
haystac sample \
--sra ERR1018966 \
--output ERR1018966
To prepare a sample of your own, you will need either single-end or paired-end short read sequencing data in
fastq
format.
For a paired-end library, you specify the location of the fastq
files and the name of
the output directory. You may also choose to collapse overlapping mate pairs (e.g. for an aDNA library).
haystac sample \
--fastq-r1 /path/to/sample1_R1.fq.gz \
--fastq-r2 /path/to/sample1_R2.fq.gz \
--collapse True \
--output sample1
By default, haystac
will scan the supplied library, identify adapter sequences, and automatically remove them.
3. Analyse a sample against a database
The third step in using haystac
is to perform an analysis of a sample against a database.
Here, we will use haystac
to calculate the mean posterior abundance of all species in the Yersinia genus found within
the sample ERR1018966
.
haystac analyse \
--mode abundances \
--database yersinia_db\
--sample ERR1018966 \
--output yersinia_ERR1018966
When the analysis is complete, there will be several new sub-folders in the output directory yersinia_ERR1018966/
. To
determine if sample ERR1018966
contains Yersinia pestis (i.e. the plague) we can consult the spreadsheet containing
the mean posterior abundance estimates for all species in the Yersinia database (i.e.,
yersinia_ERR1018966/probabilities/ERR1018966/ERR1018966_posterior_abundance.tsv
). From this, we can see that 3,266
reads were uniquely assigned to Yersinia pestis, with an overall abundance of 0.047%, and that the chi-squared test
indicates that the reads are spread evenly across the genome.
Before we can confidently conclude that ERR1018966
contains ancient Yersinia pestis, we may want to perform a
damage pattern analysis.
haystac analyse \
--mode abundances \
--database yersinia_db\
--sample ERR1018966 \
--output yersinia_ERR1018966 \
--mapdamage True
User documentation
haystac
has many features and potential uses, and we encourage you to use module help menus (e.g. haystac database --help
)
to explore these options. The full user documentation is available here: https://haystac.readthedocs.io/en/master/
Reporting errors
haystac
is under active development and we encourage you to report any issues you encounter via the GitHub issue
tracker.
Citation
A preprint describing haystac
is available on bioRxiv:
Dimopoulos, E.A.*, Carmagnini, A.*, Velsko, I.M., Warinner, C., Larson, G., Frantz, L.A.F., Irving-Pease, E.K., 2020. HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. bioRxiv 2020.12.16.419085. https://www.biorxiv.org/content/10.1101/2020.12.16.419085v1
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file haystac-0.4.11.tar.gz
.
File metadata
- Download URL: haystac-0.4.11.tar.gz
- Upload date:
- Size: 45.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77727ce69b8df4211c43b424a292cd703987946e889a370337ce4afc7b6f0edd |
|
MD5 | 64cd667647dd3d1fc632c426f47afb0c |
|
BLAKE2b-256 | e102c9bbd2aa6a1d498dc3150f6b3ba14acc4bb025dee11ba9f3b958b2f506cb |
File details
Details for the file haystac-0.4.11-py2.py3-none-any.whl
.
File metadata
- Download URL: haystac-0.4.11-py2.py3-none-any.whl
- Upload date:
- Size: 78.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9aa453dc535dc8f24af1f5b9c9f13416f8ab1dbbe21cdffaae773c3498f54e4f |
|
MD5 | 87f8e8b5dd82edaa18c3c6cb80d2901b |
|
BLAKE2b-256 | 58131ba7350fc80e835515bf316b6588116182a0302347ac8fba3812dc5de260 |