Bakta: rapid & comprehensive annotation of bacterial genomes & plasmids
Project description
Bakta: Rapid & standardized annotation of bacterial genomes & plasmids
Contents
- Description
- Input/Output
- Examples
- Installation
- Annotation workflow
- Database
- Usage
- Citation
- FAQ
- Issues & Feature Requests
Description
TL;DR
Bakta is an offline tool dedicated to the rapid & standardized annotation of bacteria & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readble (JSON
) & bioinformatics standard file formats for automatic downstream analysis.
The annotation of microbial genomes is a diverse task comprising the structural & functional annotation of different feature types with distinct overlapping characteristics. Existing local annotation pipelines cover a broad range of microbial taxa, e.g. bacteria, aerchaea, viruses. To streamline and foster the expansion of supported feature types, Bakta is strictly dedicated to the annotation of bacteria and plasmids. To standardize annotations, Bakta uses a comprehensive & versioned annotation database utilizing UniProt's UniRef clusters enriched by cross-references and specialized niche databases.
Exact matches to known protein coding sequences (CDS), subsequently referred to as identical protein sequences (IPS) are identified via MD5
digests and annotated with database cross-references (dbxref) to:
- RefSeq (
WP_*
) - UniRef100/UniRef90 (
UniRef100_*
/UniRef90_*
) - UniParc (
UPI*
)
By doing so, IPS allow the surveillance of distinct gene alleles and streamline comparative analysis. Also, posterior (external) annotations of putative
& hypothetical
protein sequences can be mapped back to existing CDS via these exact & stable identifiers (E. coli gene ymiA ...more).
Unidentified remaining CDS are annotated via UniRef90 protein sequence clusters (PSC).
PSC & IPS are enriched by pre-annotated and stored information (GO
, COG
, EC
).
Next to standard feature types (tRNA, tmRNA, rRNA, ncRNA, CRISPR, CDS, gaps) Bakta also detects and annotates:
- short ORFs (sORF) which are not predicted by tools like
Prodigal
- ncRNA cis-regulatory regions distinct from ncRNA genes
- origins of replication/transfer (oriC, oriV, oriT)
Bakta can annotate a typical bacterial genome within minutes and hence fits the niche between large & computationally-demanding (online) pipelines and rapid, highly-customizable offline tools like Prokka. Indeed, Bakta is heavily inspired by Prokka (kudos to Torsten Seemann) and many command line options are mutually compatible for the sake of interoperability and user convenience. Hence, if it doesn't fit your needs, please try Prokka.
Input/Output
Input
Bakta accepts bacterial and plasmid assemblies (complete / draft) in (zipped) fasta format.
Further genome information and workflow customizations can be provided and set via a number of input parameters. For a full description, please have a look at the Usage section.
Replicon meta data table:
To fine-tune the very details of each sequence in the input fasta file, Bakta accepts a replicon meta data table provided in tsv
file format: --replicons <tsv-replicon-file>
.
Thus, for example, complete replicons within partially completed draft assemblies can be marked & handled as such, e.g. detection & annotation of features spanning sequence edges.
Table format:
original locus id | new locus id | type | topology | name |
---|---|---|---|---|
old id |
[new id / <empty> ] |
[chromosome / plasmid / contig / <empty> ] |
[circular / linear / <empty> ] |
name |
Thus, for each input sequence recognized via the original locus id
a new locus id
, the replicon type
and the topology
as well a name
can be explicitly set.
Available short cuts:
chromosome
:c
plasmid
:p
circular
:c
linear
:l
<empty>
values (-
/ ``) will be replaced by defaults. If new locus id is empty
, a new contig name will be autogenerated.
Defaults:
- type:
contig
- topology:
linear
Example:
original locus id | new locus id | type | topology | name |
---|---|---|---|---|
NODE_1 | chrom | chromosome |
circular |
- |
NODE_2 | p1 | plasmid |
c |
pXYZ1 |
NODE_3 | p2 | p |
c |
pXYZ2 |
NODE_4 | special-contig-name-xyz | - |
- | |
NODE_5 | `` | - |
- |
Output
Bakta provides detailed information on each annotated feature in a standardized machine-readable JSON file. In addition, the following standard file formats are supported:
tsv
: annotations as simple human readble tab separated valuesGFF3
: annotations in GFF3 formatGenBank
: annotations in GenBank formatfna
: replicons/contigs as FASTAfaa
: CDS as FASTA
Examples
Simple:
$ bakta --db ~/db genome.fasta
Expert: verbose output writing results to results directory with ecoli123 file prefix
and eco634 locus tag
using an existing prodigal training file, using additional replicon information and 8 threads:
$ bakta --db ~/db --verbose --output results/ --prefix ecoli123 --locus-tag eco634 --prodigal-tf eco.tf --replicons replicon.tsv --threads 8 genome.fasta
Installation
Bakta can be installed via BioConda, Docker and Pip. To automatically install all required 3rd party dependencies, we highly encourage to use Conda. In all cases a mandatory database must be downloaded.
BioConda
$ conda install -c conda-forge -c bioconda -c defaults bakta
Docker
We provide a shell script (bakta-docker.sh) wrapping all Docker related issues, e.g. volume mounting.
$ sudo docker pull oschwengers/bakta
$ sudo docker run oschwengers/bakta --help
$ bakta-docker.sh --help
Pip
- install Bakta per pip
- install 3rd party binaries (-> Dependencies)
$ python3 -m pip install --user bakta
Dependencies
Bacta requires Biopython (>=1.72), Xopen (0.9) and the following 3rd party executables which must be installed & executable:
- tRNAscan-SE (2.0.6) https://doi.org/10.1101/614032 http://lowelab.ucsc.edu/tRNAscan-SE
- Aragorn (1.2.38) http://dx.doi.org/10.1093/nar/gkh152 http://130.235.244.92/ARAGORN
- INFERNAL (1.1.2) https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtt509 http://eddylab.org/infernal
- PILER-CR (1.06) https://doi.org/10.1186/1471-2105-8-18 http://www.drive5.com/pilercr
- Prodigal (2.6.3) https://dx.doi.org/10.1186%2F1471-2105-11-119 https://github.com/hyattpd/Prodigal
- Hmmer (3.3.1) https://doi.org/10.1093/nar/gkt263 http://hmmer.org
- Diamond (2.0.2) https://doi.org/10.1038/nmeth.3176 https://github.com/bbuchfink/diamond
- Blast+ (2.7.1) https://www.ncbi.nlm.nih.gov/pubmed/2231712 https://blast.ncbi.nlm.nih.gov
On Ubuntu/Debian/Mint you can install these via:
$ sudo apt install aragorn infernal prodigal diamond-aligner ncbi-blast+
tRNAscan-se must be installed manually as v2.0 is currently not yet available via standard Ubuntu packages.
Mandatory database
Bakta requires a mandatory database which is publicly hosted at Zenodo: Further information is provided below.
$ wget <XYZ>/db.tar.gz
$ tar -xzf db.tar.gz
$ rm db.tar.gz
The db path can either be provided via parameter (--db
) or environment variable (BAKTA_DB
):
$ bakta --db <db-path> genome.fasta
$ export BAKTA_DB=<db-path>
$ bakta genome.fasta
Additionally, for a system-wide setup, the database can be copied to the Bakta base directory:
$ cp -r db/ <bakta-installation-dir>
Annotation workflow
RNAs
- tRNA genes: tRNAscan-SE 2.0
- tmRNA genes: Aragorn
- rRNA genes: Infernal vs. Rfam rRNA covariance models
- ncRNA genes: Infernal vs. Rfam ncRNA covariance models
- ncRNA cis-regulatory regions: Infernal vs. Rfam ncRNA covariance models
- CRISPR arrays: PILER-CR
Bakta distinguishes ncRNA genes and (regulatory) regions in order to enable the distinct handling thereof during the annotation process, i.e. feature overlap detection.
ncRNA gene types:
- sRNA
- antisense
- ribozyme
- antitoxin
ncRNA (regulatory) region types:
- riboswitch
- thermoregulator
- leader
- frameshift element
Coding sequences
The structural prediction is conducted via Prodigal and complemented by a custom detection of short open reading freames (sORF) < 30 aa.
To rapidly conduct a comprehensive annotation while also identifing known protein sequences with exact sequence matches, Bakta uses a comprehensive SQLite database comprising protein sequence digests and pre-annotations for millions of known protein sequences and clusters.
Conceptual terms:
- UPS: unique protein sequences identified via length and MD5 sequence digests (100% coverage & 100% sequence identity)
- IPS: identical protein sequences comprising representatives of UniProt's UniRef100 protein sequence clusters
- PSC: protein sequences clusters comprising representatives of UniProt's UniRef90 protein sequence clusters
CDS:
- Prediction via Prodigal
- Detection of UPSs via MD5 digests and lookup of related IPS and PCS
- Homology search of remainder via Diamond vs. PSC
- Combination of available IPS & PSC information favouring more specific annotations and avoiding redundancy
CDS without IPS or PSC hits will be marked as hypothetical
.
Additionally, all CDS without gene symbols or with product descriptions equal to hypothetical
will be marked as hypothetical
.
However, hypothetical
CDS are included in the final annotation.
sORFs:
- Custom detection & extraction of sORF with amino acid lengths < 30 aa
- Filter via strict feature type-dependent overlap filters with annotated features
- Detection of UPS via MD5 hashes and lookup of related IPS
- Homology search of remainder via Diamond vs. seed sequences of an sORF subset of UniProt's UniRef90 PSC
- Exclude sORF without sufficient annotation information
sORF not identified via IPS or PSC will be discarded.
Additionally, all sORF without gene symbols or with product descriptions equal to hypothetical
will be discarded.
Due due to uncertain nature of sORF prediction, only those identified via IPS / PSC hits exhibiting proper gene symbols or product descriptions different from hypothetical
will be included in the final annotation.
Miscellaneous
- Gaps: in-mem detection & annotation of sequence gaps
- oriC/oriV/oriT: Blast+ (blastn) vs. MOB-suite oriT & DoriC oriC/oriV sequences. Annotations of ori regions take into account overlapping Blast+ hits and are conducted based on a majority vote heuristic.
Database
The Bakta database comprises a set of DNA & AA sequence databases as well as HMM & covariance models. At its core Bakta uses a compact SQLite db storing protein sequence digests, lengths, pre-annotations and dbxrefs of UPS, IPS and PSC from:
- UPS: UniParc / UniProtKB (192,795,177)
- IPS: UniProt UniRef100 (169,958,214)
- PSC: UniProt UniRef90 (77,128,011)
This allows the exact protein sequences identification via MD5 digests & sequence lengths as well as the rapid subsequent lookup of related information. IPS & PSC have been comprehensively pre-annotated integrating annotations & database dbxrefs from:
- NCBI nonredundant proteins ('WP_*' -> 139,330,543)
- NCBI COG db (80% cov / 90% id -> 1,893,080)
- GO terms (via IPS/PSC SwissProt entries)
- EC (via IPS/PSC SwissProt entries)
- NCBI AMRFinderPlus (IPS exact matches, PSC HMM hits reaching trusted cutoffs)
- ISFinder db (90% cov / 99% id -> 2,981)
Rfam covariance models:
- ncRNA: 750
- ncRNA cis-regulatory regions: 107
To pinpoint annotations and provide reproducible analysis, the database releases are SemVer versioned (w/o patch level), i.e. <major>.<minor>
.
The db schema is represented by the <major>
digit and automatically checked at runtime by Bakta in order to ensure compatibility. Content updates are tracked by the <minor>
digit.
All database releases (latest 1.0, 23 Gb zipped, 43 Gb unzipped) are hosted at Zenodo:
Usage
Usage:
bakta --help
usage: bakta [--db DB] [--min-contig-length MIN_CONTIG_LENGTH] [--prefix PREFIX] [--output OUTPUT] [--genus GENUS] [--species SPECIES] [--strain STRAIN] [--plasmid PLASMID] [--complete] [--prodigal-tf PRODIGAL_TF] [--translation-table {11,4}] [--gram {+,-,?}] [--locus LOCUS]
[--locus-tag LOCUS_TAG] [--keep-contig-headers] [--replicons REPLICONS] [--skip-trna] [--skip-tmrna] [--skip-rrna] [--skip-ncrna] [--skip-ncrna-region] [--skip-crispr] [--skip-cds] [--skip-sorf] [--skip-gap] [--skip-ori] [--help] [--verbose] [--threads THREADS]
[--tmp-dir TMP_DIR] [--version] [--citation]
<genome>
Comprehensive and rapid annotation of bacterial genomes.
positional arguments:
<genome> (Draft) genome in fasta format
Input / Output:
--db DB, -d DB Database path (default = <bakta_path>/db)
--min-contig-length MIN_CONTIG_LENGTH, -m MIN_CONTIG_LENGTH
Minimum contig size (default = 1)
--prefix PREFIX, -p PREFIX
Prefix for output files
--output OUTPUT, -o OUTPUT
Output directory (default = current working directory)
Organism:
--genus GENUS Genus name
--species SPECIES Species name
--strain STRAIN Strain name
--plasmid PLASMID Plasmid name
Annotation:
--complete All sequences are complete replicons (chromosome/plasmid[s])
--prodigal-tf PRODIGAL_TF
Path to existing Prodigal training file to use for CDS prediction
--translation-table {11,4}
Translation table to use: 11/4 (default = 11)
--gram {+,-,?} Gram type: +/-/? (default = '?')
--locus LOCUS Locus prefix (instead of 'contig')
--locus-tag LOCUS_TAG
Locus tag prefix
--keep-contig-headers
Keep original contig headers
--replicons REPLICONS, -r REPLICONS
Replicon information table (TSV)
Workflow:
--skip-trna Skip tRNA detection & annotation
--skip-tmrna Skip tmRNA detection & annotation
--skip-rrna Skip rRNA detection & annotation
--skip-ncrna Skip ncRNA detection & annotation
--skip-ncrna-region Skip ncRNA region detection & annotation
--skip-crispr Skip CRISPR array detection & annotation
--skip-cds Skip CDS detection & annotation
--skip-sorf Skip sORF detection & annotation
--skip-gap Skip gap detection & annotation
--skip-ori Skip oriC/oriT detection & annotation
General:
--help, -h Show this help message and exit
--verbose, -v Print verbose information
--threads THREADS, -t THREADS
Number of threads to use (default = number of available CPUs)
--tmp-dir TMP_DIR Location for temporary files (default = system dependent auto detection)
--version show program's version number and exit
--citation Print citation
Citation
A manuscript is in preparation. To temporarily cite our work, please transitionally refer to:
Schwengers O., Goesmann A. (2020) Bakta: Rapid & standardized annotation of bacterial genomes & plasmids. GitHub https://github.com/oschwengers/bakta
Bakta takes advantage of many publicly available databases. If you find any of the data used within Bakta useful, please also be sure to credit the primary source also:
- UniProt: https://doi.org/10.1093/nar/gky1049
- RefSeq: https://doi.org/10.1093/nar/gkx1068
- Rfam: https://doi.org/10.1002/cpbi.51
- AMRFinder: https://doi.org/10.1128/AAC.00483-19
- ISFinder: https://doi.org/10.1093/nar/gkj014
- AntiFam: https://doi.org/10.1093/database/bas003
- Mob-suite: https://doi.org/10.1099/mgen.0.000206
- DoriC: https://doi.org/10.1093/nar/gky1014
- COG: https://doi.org/10.1093/bib/bbx117
FAQ
- Bakta is running too long without CPU load... why? Bakta takes advantage of an SQLite DB which results in high storage IO loads. If this DB is stored on a remote / network volume, the lookup of IPS/PSC annotations might take a long time. In these cases, please, consider moving the DB to a local volume/hard drive.
Issues and Feature Requests
If you run into any issues with Bakta, we'd be happy to hear about it!
Please, execute bakta in verbose mode (-v
) and do not hesitate
to file an issue including as much information as possible:
- a detailed description of the issue
- command line output
- log file (
<prefix>.log
) - result file (
<prefix>.json
) if possible - a reproducible example of the issue with an input file that you can share if possible
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.