Rapid and standardized annotation of bacterial genomes, MAGs and plasmids using protein structural information
Project description
Baktfold
Rapid & standardized protein annotation using structural information
Baktfold is a sensitive annotation tool for protein annotation using structural homology. While it was designed with bacterial genomes in mind to work in conjunction with Bakta (hence the name!), Baktfold also works well on archaea, plasmids and even eukaryotes.
Baktfold is similar to Phold but goes beyond phages.
Baktfold takes all hypothetical proteins from Bakta's output and uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a series of databases (SwissProt, AlphaFold Database non-singleton clusters, PDB and CATH).
Additionally, instead of using ProstT5, you can specify protein structures that you have pre-computed for your hypothetical proteins.
You can also specify custom databases to search against using --custom-db.
Baktfold is currently under active development. We would welcome any and all feedback (especially bugs) via Issues
Google Colab Notebook
If you don't want to install Baktfold locally, you can run it without any code using the Google Colab notebook
Table of Contents
Install
Conda (recommended)
The best way to install Baktfold is using conda, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies We would highly recommend installing Conda via Miniforge:
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold
To utilise phold with GPU, a GPU compatible version of pytorch must be installed (default: CPU-only version):
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold pytorch=*=cuda*
If you have a Mac with M-series Apple Silicon, you may need to install a particular version of Pytorch to utilise GPU-acceleration. The same is true if you use other non-NVIDIA e.g. AMD GPUs. See this link for some more detail and further links
Pip
You can also install Baktfold using Pip:
pip install baktfold
You will need to have Foldseek (ideally v10.941cd33) installed and available in the $PATH.
Source
You can install the latest version of Baktfold with potentially untested and unreleased changes into a conda environment using conda as follows:
conda create -n baktfoldENV foldseek
conda activate baktfoldENV
git clone https://github.com/gbouras13/baktfold.git
cd baktfold
pip install .
baktfold --help
Database Installation
To download and install Baktfold's databases (use as many threads with -t as you can to speed up downloading):
baktfold install -d baktfold_db -t 8
If you have an NVIDIA GPU, you will need to format the database to allow it to use Foldseek-GPU with --foldseek-gpu.
Note: you can do this after downloading the database with the above command (it won't redownload the database, only do the relevant Foldseek database padding)
baktfold install -d baktfold_db --foldseek-gpu
Example - Bacteria
First, you need to run Bakta and use the resulting .json file as input for Baktfold. For bacteria or plasmids, we always recommend Bakta.
Running Baktfold on Bakta results (using a dummy test example JSON assembly.json file):
# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json -o baktfold_output -f -t 8 -d baktfold_db
# with Nvidia GPU
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json -o baktfold_output -f -t 8 -d baktfold_db --foldseek-gpu
Running Baktfold on protein sequences (using a dummy test example Fasta .faa file):
# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa -o baktfold_proteins_output -f -t 8 -d baktfold_db
# with Nvidia GPU
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa -o baktfold_proteins_output -f -t 8 -d baktfold_db --foldseek-gpu
Note that this can be any .faa. It does not have to be the output of Bakta.
Conversion wrapper commands
If you have not used Bakta to annotate your genome before running Baktfold, you have two choices: (1) annotate proteins only with baktfold proteins or (2) if you have a GenBank format file, you will need to convert your GenBank to the Bakta .json format
To do this, you have 3 options:
baktfold convert-prokka
- If you have used Prokka to annotate your genome,
baktfoldhas a subcommand that will do the conversion for you - e.g.
baktfold convert-prokka -i prokka.gbk -o prokka.json
baktfold convert-euk
-
This is an experimental feature for eukaryotes (protists, fungi etc) - you can try converting these with a subcommand
-
You will then need to pass
--euktobaktfold runas well to make sure it can handle the different genomic features of eukaryotes -
e.g.
baktfold convert-euk -i euk.gbk -o euk.json
genbank_to
-
If neither of those work for you, you try the genbank_to package which has the functionality of converting a genbank file into the Bakta format JSON
-
You will need to install it separately (
pip install genbank_to) then -
e.g.
genbank_to -g test.gbk --bakta-json test.json
Usage
The two most useful commands are baktfold run and baktfold proteins
baktfold runaccepts a Bakta JSON file as input, and by default, it will annotate all hypothetical CDS and return a variety of Bakta-like compliant output formats. All other annotations will be inherited from the Bakta outputbaktfold proteinsaccepts a protein FASTA.faaformat file as input. It will annotate all protein sequences and return a variety ofbakta_proteins-like output formatsbaktfold predictandbaktfold comparesplitbaktfold runinto the ProstT5 and Foldseek modules, whilebaktfold proteins-predictandbaktfold proteins-comparedo the same forbaktfold proteins(useful if you have non-NVIDIA GPUs)
It is recommend you run Baktfold with a GPU if you can. If you do not have a GPU, Baktfold will still run, but the ProstT5 step will be fairly slow. If you have a NVIDIA GPU, you can also use the --foldseek-gpu parameter to accelerate Foldseek further
Usage: baktfold [OPTIONS] COMMAND [ARGS]...
Main command line interface for baktfold.
Returns: None
Examples: >>> main_cli() None
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
Commands:
autotune Determines optimal batch size for 3Di prediction with...
citation Print the citation(s) for this tool
compare Runs Foldseek vs baktfold db
convert-euk (Experimental) Converts eukaryotic GenBank to Bakta...
convert-prokka Converts Prokka GenBank to Bakta format json
createdb Creates foldseek DB from AA FASTA and 3Di FASTA input...
install Installs ProstT5 model and baktfold database
predict Uses ProstT5 to predict 3Di tokens - GPU recommended
proteins baktfold proteins-predict then comapare all in one -...
proteins-compare Runs Foldseek vs baktfold db on proteins input
proteins-predict Runs ProstT5 on a multiFASTA input - GPU recommended
run baktfold predict then comapare all in one - GPU...
Usage: baktfold run [OPTIONS]
baktfold predict then comapare all in one - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Bakta Genbank format or
Bakta JSON format [required]
-o, --output PATH Output directory [default: output_baktfold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: baktfold]
-d, --database TEXT Specific path to installed baktfold database
-f, --force Force overwrites the output directory
--autotune Run autotuning to detect and automatically
use best batch size for your hardware.
Recommended only if you have a large dataset
(e.g. thousands of proteins), or else
autotuning will add rather than save runtime.
--batch-size INTEGER batch size for ProstT5. 1 is usually fastest.
[default: 1]
--cpu Use cpus only.
--omit-probs Do not output per residue 3Di probabilities
from ProstT5. Mean per protein 3Di
probabilities will always be output.
--save-per-residue-embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save-per-protein-embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
--mask-threshold FLOAT Masks 3Di residues below this value of
ProstT5 confidence for Foldseek searches
[default: 25]
-e, --evalue FLOAT Evalue threshold for Foldseek [default:
1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep-tmp-files Keep temporary intermediate files,
particularly the large foldseek_results.tsv
of all Foldseek hits
--max-seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra-sensitive Runs baktfold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra-foldseek-params TEXT Extra foldseek search params
--custom-db TEXT Path to custom database
--foldseek-gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
--custom-annotations PATH Custom Foldseek DB annotations, 2 column tsv.
Column 1 matches the Foldseek headers, column
2 is the description.
--euk Eukaryotic input genome.
--fast Skips Foldseek search against AFDB Clusters.
-a, --all-proteins annotate all proteins (not just
hypotheticals)
Output
The majority of outputs match Bakta. Specifically, all the format compliant outputs match Bakta's.
The differences are:
<prefix>.inference.tsvis different compared to Bakta.- In Baktfold, this file gives a quick overview of the different Baktfold databases for which the query protein has a hit (if any)
For example:
ID Length Product Swissprot AFDBClusters PDB CATH
MEGJMNBEGN_27 162 HTH-type quorum-sensing regulator RhlR swissprot_P54292 afdbclusters_A0A9E1VSB0 pdb_5l09 cath_3sztB01
MEGJMNBEGN_30 68 hypothetical protein
MEGJMNBEGN_70 94 hypothetical protein afdbclusters_A0A1I3V7E0
<prefix>_<database>_tophit.tsvfiles give the detailed Foldseek alignment information for each tophit found for each database.
For example:
query target bitscore fident evalue qStart qEnd qLen qCov tStart tEnd tLen tCov
MEGJMN_070 AF-A0A1I3V7E0-F1-model_v6 292 0.41 2.619e-06 1 91 93 0.97 1 95 99 0.95
- The full Foldseek search outputs are not kept by default (only tophits). You can keep the full Foldseek search TSVs using
--keep-tmp-files. They will be calledfoldseek_results_<database>.tsv. baktfold_3di.fastawhich gives the 3Di tokens for each input CDSbaktfold_prostT5_3di_mean_probabilities.csvandbaktfold_prostT5_3di_all_probabilities.json, which give some score of the confidence ProstT5 has in its predictions. You can disable this output with--omit-probs- Baktfold does not have plotting functionality like Bakta (yet)
Conceptual terms
As Baktfold inherits annotations and related conceptual terms from Bakta. Hence, we kindly refer to Bakta's readme. In addition, Baktfold introduces one conceptual term:
PSTC: protein structure clusters. These comprise of structure-based annotations to any of Baktfold's databases
Citations
A manuscript describing Baktfold is in preparation
Please, be sure to cite the following core dependencies - citing all bioinformatics tools that you use helps us, so helps you get better bioinformatics tools:
- Foldseek - (https://github.com/steineggerlab/foldseek) van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology (2023), doi:10.1038/s41587-023-01773-0
- ProstT5 - (https://github.com/mheinzinger/ProstT5) Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, Burkhard Rost. ProstT5: Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics (2024) doi:10.1101/2023.07.23.550085
Please also consider citing these databases where relevant:
- AFDB/SwissProt - Mihaly Varadi, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, Malarvizhi Radhakrishnan, Maxim Tsenkov, Sreenath Nair, Milot Mirdita, Jingi Yeo, Oleg Kovalevskiy, Kathryn Tunyasuvunakool, Agata Laydon, Augustin Žídek, Hamish Tomlinson, Dhavanthi Hariharan, Josh Abrahamson, Tim Green, John Jumper, Ewan Birney, Martin Steinegger, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D368–D375, https://doi.org/10.1093/nar/gkad1011
- CATH - Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH--a hierarchic classification of protein domain structures. Structure. 1997 Aug 15;5(8):1093-108. doi: 10.1016/s0969-2126(97)00260-8. PMID: 9309224.
- PDB - H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank (2000) Nucleic Acids Research 28: 235-242 https://doi.org/10.1093/nar/28.1.235
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file baktfold-0.1.1.tar.gz.
File metadata
- Download URL: baktfold-0.1.1.tar.gz
- Upload date:
- Size: 3.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69cfc13a818a2f20adf5f791211888c60744fb7493b9ccf1c7e5a359d46d3c2e
|
|
| MD5 |
d0f81cdd383ffa0e54931d9476d7b659
|
|
| BLAKE2b-256 |
663b73937c74d85f5a2499d0d017746ace3c7205ebe4b5ffb324288dab448b1b
|
File details
Details for the file baktfold-0.1.1-py3-none-any.whl.
File metadata
- Download URL: baktfold-0.1.1-py3-none-any.whl
- Upload date:
- Size: 3.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27d1f297612ba7d6c8e7f38db4767dc328fe3a0ec5fb28a97bff1857e9099775
|
|
| MD5 |
d918e52d06717e78efae0c8337222015
|
|
| BLAKE2b-256 |
6460fe43a403c45b75436325b048ad30cab26e9891f1e764a59a0c4ea3fea1d1
|