Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.
Project description
🍳 Metagenomic-DeepFRI
A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.
🔍 Overview
Proteins perform most of the work of living cells. Amino acid sequence and structural features of proteins determine a wide range of functions: from binding specificity and conferring mechanical stability, to catalysis of biochemical reactions, transport, and signal transduction. DeepFRI is a neural network designed to predict protein function within the framework of the Gene Ontology (GO). The exponential growth in the number of available protein sequences, driven by advancements in low-cost sequencing technologies and computational methods (e.g., gene prediction), has resulted in a pressing need for efficient software to facilitate the annotation of protein databases. Metagenomic-DeepFRI addresses such need, building upon efficient libraries. It incorporates novel databases of predicted structures (AlphaFold, ESMFold, MIP, etc.) and improves runtimes of DeepFRI by 2-12 times!
📋 Pipeline stages
- Search proteins similar to query in PDB and supplied
FoldComp
databases withMMSeqs2
. - Find the best alignment among
MMSeqs2
hits usingPyOpal
. - Align target protein contact map to query protein with unknown structure.
- Run
DeepFRI
with structure if it was found in database, otherwise runDeepFRI
with sequence only.
🛠️ Built With
🔧 Installation
- Clone repo locally
git clone https://github.com/bioinf-mcb/Metagenomic-DeepFRI
cd Metagenomic-DeepFRI
- Setup conda environment
conda env create --name deepfri --file environment.yml
conda activate deepfri
- Show help message
mDeepFRI --help
💡 Usage
1. Prepare structural database
The PDB database will be automatically downloaded and installed during first run of mDeepFRI
. You can download additional databases from website. The app was tested with afdb_swissprot_v4
. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.
2. Download models
Two versions of models available:
v1.0
- is the original version from DeepFRI publication.v1.1
- is a version finetuned on AlphaFold models and Gene Ontology Uniprot annotations. To download models run command:
mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}
3. Predict protein function & capture log
mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path 2> log.txt
The logging
module writes output into stderr
, so use 2>
to redirect it to the file.
Other available parameters can be found upon command mDeepFRI --help
.
✅ Results
The output folder will contain:
{database_name}.search_results.tsv
metadata_skipped_ids_due_to_length.json
- too long or too short queries (DeepFRI is designed to predict the function of proteins in the range of 60-1000 aa).query.mmseqsDB
+ index from MMSeqs2 search.results.tsv
- a final output from the DeepFRI model.
Example output (results.tsv
)
Protein | GO_term/EC_numer | Score | Annotation | Neural_net | DeepFRI_mode | DB_hit | DB_name | Identity |
---|---|---|---|---|---|---|---|---|
MIP_00215364 | GO:0016798 | 0.218 | hydrolase activity, acting on glycosyl bonds | gcn | mf | MIP_00215364 | mip_rosetta_hq | 0.933 |
1GVH_1 | GO:0009055 | 0.217 | electron transfer activity | gnn | mf | AF-P24232-F1-model_v4 | afdb_swissprot_v4 | 1.0 |
unaligned | 3.2.1.- | 0.215 | 3.2.1.- | cnn | ec | nan | nan | nan |
This is an example of protein annotation with the AlphaFold database.
- Protein - the name of the protein from the FASTA file.
- GO_term/EC_numer - predicted GO term or EC number (dependent on mode)
- Score - DeepFRI score, translates to model confidence in prediction. Details in publication.
- Annotation - annotation from ontology
- Neural_net - type of neural network used for prediction (gcn = Graph Convolutional Network; cnn = Convolutional Neural Network). GCN (Graph Convolutional Network) is employed when structural information is available in the database, allowing for generally more confident predictions.
- DeepFRI_mode:
mf = molecular_function bp = biological_process cc = cellular_component ec = enzyme_commission
⚙️Features
1. Prediction modes
The GO ontology contains three subontologies, defined by their root nodes:
- Molecular Function (MF)
- Biological Process (BP)
- Cellular Component (CC)
- Additionally, Metagenomic-DeepFRI v1.0 is able to predict Enzyme Comission number (EC).
By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter
-p
or--processing-modes
few times, i.e.:
mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path -p mf -p bp
2. Hierarchical database search
Different databases have a different level of evidence. For example, PDB structures are real experimental structures, thus they are considered to be the data of highest quality. Therefore new proteins are first queried against PDB. Computational predictions differ by quality, i.e. AlphaFold predictions are often more accurate than ESMFold predictions. We provide an opporunity to search multiple databases in a hierarchical manner. For example, if you want to search AlphaFold database first, and then ESMFold, you can pass the parameter -d
or --databases
few times, i.e.:
mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path
3. Temporary files
The first run of mDeepFRI
with the database will create temporary files, needed for the pipeline. If you don't want to keep them for the next run add
flag --remove-intermediate
.
4. CPU / GPU utilization
If argument threads
is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation).
GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed on your machine, mDeepFRI
will automatically use it for prediction. If not, the model will use CPUs.
Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.
🔖 Citations
Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:
- Gligorijević et al. "Structure-based protein function prediction using graph convolutional networks" Nat. Comms. (2021). https://doi.org/10.1038/s41467-021-23303-9
- Steinegger & Söding "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets" Nat. Biotechnol. (2017) https://doi.org/10.1038/nbt.3988
- Kim, Midrita & Steinegger "Foldcomp: a library and format for compressing and indexing large protein structure sets" Bioinformatics (2023) https://doi.org/10.1093/bioinformatics/btad153
- Maranga et al. "Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method" mSystems (2023) https://doi.org/10.1128/msystems.01178-22
💭 Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
🏗️ Contributing
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
📋 Changelog
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
⚖️ License
This library is provided under the The 3-Clause BSD License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file mDeepFRI-1.1.2.tar.gz
.
File metadata
- Download URL: mDeepFRI-1.1.2.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c29298539f2b92c00b0d855f50cceae25c33aaf2d7f926807f3b8d0c0d8f4759 |
|
MD5 | 7a61ccac6dfe5fc5bcc7d4c8e63427e3 |
|
BLAKE2b-256 | 9556879d3c34ae438f98557e79c35cb9cb4cfd324f3cedc972551bdae367c853 |
File details
Details for the file mDeepFRI-1.1.2-cp312-cp312-macosx_11_0_arm64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b93f2d05723d0ad6afc28391b25c8271abd19b3bd599e4f0d14d1accb1e40810 |
|
MD5 | e62e7e3b2a0caf261ce0ad7610bc562e |
|
BLAKE2b-256 | 576acc3a1764f5b2c88b5697c8edc69e070cd029bcbd96cc3f24c507c5a592d0 |
File details
Details for the file mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.9 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b3cc87bd8887ab11fa10f9c7114e95ba306c1302b1a51ad3b14ba2e738eb748 |
|
MD5 | 135513688828f32b7dc99313e176243b |
|
BLAKE2b-256 | 6863c15ca8071600272e336bc259bfa70a33b6770ae8f3e8e95410ae0067fb30 |
File details
Details for the file mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 2.9 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6fc826497ee5bd98c4be5f1f844e2b0d799c11b603d14c3ec03298f9a6c9d40 |
|
MD5 | 2a8fa461e68dab80e508f7a252549266 |
|
BLAKE2b-256 | f7b7f6c35f94dcef886d053b9a3ef7b51c8d607463c5e750dae8c1f5ea87b40b |
File details
Details for the file mDeepFRI-1.1.2-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9021de8e686b6faf291c8e07699bfb956a9ec878cbd2a8bd45872057ee103f26 |
|
MD5 | 41a89dd26740d6af8cee291586512931 |
|
BLAKE2b-256 | 17dc68cad182822e2da4cc2717acad612fd4faa3098d88eff5f081b97180274a |
File details
Details for the file mDeepFRI-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.11, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 197ee8cb7119413f92a38f578cc5be22299eb992dd11ade936a6e3b481318851 |
|
MD5 | cc67f88f4edddc2a89365d9b8db942a5 |
|
BLAKE2b-256 | 7432470ac0628173eee3c529cb700e72b3a72747630a9b475e91234156d55c00 |
File details
Details for the file mDeepFRI-1.1.2-cp310-cp310-macosx_11_0_arm64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a6a495f066843349859a981324f392140c081403d45279550393e58682e53e6 |
|
MD5 | cfcdb9917542ca462afa97eb72e2ae69 |
|
BLAKE2b-256 | 3782b4cf652ad164eb1ec6d08285af05cbfdf996c10b502bc4bfb99f5df0f08f |
File details
Details for the file mDeepFRI-1.1.2-cp39-cp39-macosx_11_0_arm64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | abe170ec0fae6ce3e600a6b2871dd6fb303745fe9d26fc50e476db88f376e483 |
|
MD5 | 35d56a956a3ae8c98d65bbc8bfe2706f |
|
BLAKE2b-256 | 7b4bbc5d9c203c5c7b0115e7ff1b7724b0c45c0f86bf2ae5b0d19a8bb949f0bd |
File details
Details for the file mDeepFRI-1.1.2-cp38-cp38-macosx_11_0_arm64.whl
.
File metadata
- Download URL: mDeepFRI-1.1.2-cp38-cp38-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.8, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca9ea38d205ce0fe22fd5d5e7e48bf1ec9c92152ca73860fbd9d3ee9ea1ccf22 |
|
MD5 | 09bfeb8a21cbaa2195ff59a1dfa90ee0 |
|
BLAKE2b-256 | d485ee6cd6ff2c88663db3dc95e48459f5055a052e3f8b52ff4feb478ba1c732 |