Pipeline for searching and aligning contact maps for proteins, and function prediction with DeepFRI.
Project description
🍳 Metagenomic-DeepFRI 
A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.
🔍 Overview
Metagenomic-DeepFRI is a high-performance pipeline for annotating protein sequences with Gene Ontology (GO) terms using DeepFRI, a deep learning model for functional protein annotation.
Protein function prediction is increasingly important as sequencing technologies generate vast numbers of novel sequences. Metagenomic-DeepFRI combines:
- Structure information from FoldComp databases (AlphaFold, ESMFold, PDB, etc.)
- Sequence-based predictions using DeepFRI's neural networks
- Fast searches with MMseqs2 for database alignment
- Significant speedup of 2-12× compared to standard DeepFRI implementation.
📋 Pipeline stages
- Search proteins similar to query in PDB and supply
FoldCompdatabases withMMseqs2. - Find the best alignment among
MMseqs2hits usingPyOpal. - Align target protein contact map to query protein with unknown structure.
- Run
DeepFRIwith the structure if found in the database, otherwise runDeepFRIwith sequence only.
🛠️ Built With
📦 Requirements
- Python: >= 3.11, < 3.13 (tested with 3.11)
- Dependencies: Automatically installed via pip
🔧 Installation
1. Install from PyPI. Installation might take a few minutes due to download of MMseqs2 binaries.
pip install mdeepfri
2. Run and view the help message.
mDeepFRI --help
💡 Usage
1. Prepare structural database
1.1 Existing FoldComp databases
The PDB database will be automatically downloaded and installed
during the first run of mDeepFRI.
The PDB suffers from formatting inconsistencies, therefore during PDB alignment
around 10% will fail and will be reported via WARNING.
We suggest coupling PDB search with predicted databases, as it massively
improves the structural coverage of the protein universe.
A good protein structure allows DeepFRI to annotate the function in more detail.
However, the sequence branch of the model has the largest weight,
thus even if the predicted structure is erroneous,
it will have a minor effect on the prediction.
The details can be found in
the original manuscript, fig. 2A.
You can download additional databases
from website.
During a first run, FASTA sequences will be extracted
from FoldComp database and MMseqs2 database will be created and indexed.
You can use different databases, but be mindful
that computation time might increase exponentially with the size of the database.
Tested databases:
afdb_swissprotafdb_swissprot_v4afdb_rep_v4afdb_rep_dark_v4afdb_uniprot_v4esmatlasesmatlas_v2023_02highquality_clust30
ATTENTION: Please, do not rename downloaded databases.
FoldComp has certain inconsistencies in the way FASTA sequences are extracted (example),
therefore pipeline was tweaked for each database.
If database you need does not work, please report in
issues.
ATTENTION: database creation is a very sensitive step which
relies on external software.
If pipeline is interrupted during this step, the databases might be corrupted.
If you are not sure about your database,
rerun the pipeline with --overwrite flag -
it will rerun database creation process.
1.2. Custom FoldComp database
In order to use personal database of structures,
you will have to create a custom FoldComp database.
For that, download a FoldComp executable and run the following command:
foldcomp compress [-t number] <dir|tar(.gz)> [<dir|tar|db>]
2. Download models
Two versions of models available:
v1.0- is the original version from DeepFRI publication.v1.1- is a version finetuned on AlphaFold models and machine-generated Gene Ontology Uniprot annotations. You can read details aboutv1.1in ISMB 2023 presentation by Pawel Szczerbiak
To download models run command:
mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}
3. Predict protein function & capture log
mDeepFRI predict-function -i /path/to/protein/sequences \
-d /path/to/foldcomp/database/ \
-w /path/to/deepfri/weights/folder \
-o /output_path > log.txt
Other available parameters can be found upon command mDeepFRI --help.
✅ Results
The output folder will contain several files from different stages of the pipeline:
Main Output Files
-
results.tsv- Primary output file containing all functional predictions from the DeepFRI model. -
alignment_summary.tsv- Summary of alignment statistics for each query protein, showing which queries were successfully aligned to database structures. -
database_search/- Directory containing individual search results for each database queried:{database_name}_results.tsv- One file per database searched (e.g.,pdb100_230517_results.tsv,afdb_swissprot_v4_results.tsv)
-
prediction_matrix_*.tsv- Detailed prediction matrices for each ontology mode:prediction_matrix_bp.tsv- Biological Process predictionsprediction_matrix_cc.tsv- Cellular Component predictionsprediction_matrix_ec.tsv- Enzyme Commission predictionsprediction_matrix_mf.tsv- Molecular Function predictions
These files contain raw prediction scores for every protein × GO term combination and can be very large (>50MB).
-
query.mmseqsDB+ associated index files - MMseqs2 database created from input query sequences.
Primary Output Format (results.tsv)
The main output file contains the following columns:
- protein - Name of the protein from the input FASTA file.
- network_type - Type of neural network used for prediction:
gcn(Graph Convolutional Network) - Used when structural information is available from database alignment, providing more confident predictions.cnn(Convolutional Neural Network) - Used when no proteins above similarity cutoff (50% identity by default) are found.
- prediction_mode - Ontology category:
mf(Molecular Function),bp(Biological Process),cc(Cellular Component), orec(Enzyme Commission). - go_term - Predicted GO term identifier or EC number.
- score - DeepFRI confidence score for the prediction. Higher scores indicate greater confidence. See the DeepFRI publication for details.
- go_name - Human-readable annotation from the Gene Ontology or EC nomenclature.
- aligned - Boolean indicating whether the query was successfully aligned
to a database structure (
True/False). - target_id - Identifier of the matched database entry (e.g.,
3al6_Dfor PDB chain). Empty if no hit was found. - db_name - Name of the database where the match was found
(e.g.,
pdb100_230517,afdb_swissprot_v4). - query_identity - Sequence identity percentage between query and hit (0.0-1.0 scale). Empty if no hit was found.
- query_coverage - Proportion of query sequence covered by the alignment (0.0-1.0 scale).
- target_coverage - Proportion of target sequence covered by the alignment (0.0-1.0 scale).
⚙️Features
1. Prediction modes
The GO ontology contains three subontologies, defined by their root nodes:
- Molecular Function (MF)
- Biological Process (BP)
- Cellular Component (CC)
- Additionally, Metagenomic-DeepFRI v1.0 can predict Enzyme Commission (EC) numbers.
By default, the tool makes predictions in all 4 categories. To select only a
few pass the parameter -p or --processing-modes few times, i.e.:
mDeepFRI predict-function -i /path/to/protein/sequences \
-d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder \
-o /output_path -p mf -p bp
2. Hierarchical database search
Different databases have different levels of evidence. For example, PDB
structures are real experimental structures and are considered highest quality
data. New proteins are first queried against PDB. Computational predictions
differ by quality (e.g., AlphaFold predictions are often more accurate than
ESMFold). You can search multiple databases hierarchically for flexibility.
For example, to search AlphaFold first, then ESMFold, pass the parameter -d
or --databases multiple times:
mDeepFRI predict-function -i /path/to/protein/sequences \
-d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ \
-w /path/to/deepfri/weights/folder -o /output_path
3. Temporary files
The first run of mDeepFRI with the database will create temporary files
needed for the pipeline. If you don't want to keep them for the next run, add
flag --remove-intermediate.
4. Skipping prediction matrices
By default, mDeepFRI writes detailed prediction matrix files
(prediction_matrix_*.tsv) containing raw scores for every protein × GO term
combination. These files can be very large (>50MB each). If you only need the
final results.tsv file and want to save disk space, use the --skip-matrix
flag:
mDeepFRI predict-function -i /path/to/protein/sequences \
-d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder \
-o /output_path --skip-matrix
5. CPU / GPU utilization
If argument threads is provided, the app will parallelize certain steps
(alignment, contact map alignment, functional annotation). GPU is often used
to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if
CUDA is installed, mDeepFRI will automatically use it for prediction.
Otherwise, the model will use CPUs.
Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.
Troubleshooting GPU usage:
If onnxruntime cannot find CUDA libraries despite them being installed,
you might see errors like:
[W:onnxruntime:Default, onnxruntime_pybind_state.cc:1013 CreateExecutionProviderFactoryInstance]
Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*.
To fix this, add the library paths to LD_LIBRARY_PATH.
If you installed nvidia-* packages via pip,
you can dynamically find and export the paths:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(python -c 'import os, nvidia.cudnn, nvidia.cublas, nvidia.cuda_runtime; libs=[nvidia.cudnn, nvidia.cublas, nvidia.cuda_runtime]; print(":".join([os.path.join(m.__path__[0], "lib") for m in libs]))')
🔖 Citations
Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:
- Gligorijević et al. "Structure-based protein function prediction using graph convolutional networks" Nat. Comms. (2021). https://doi.org/10.1038/s41467-021-23303-9
- Steinegger & Söding "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets" Nat. Biotechnol. (2017) https://doi.org/10.1038/nbt.3988
- Kim, Midrita & Steinegger "Foldcomp: a library and format for compressing and indexing large protein structure sets" Bioinformatics (2023) https://doi.org/10.1093/bioinformatics/btad153
- Maranga et al. "Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method" mSystems (2023) https://doi.org/10.1128/msystems.01178-22
💭 Feedback
⚠️ Issue Tracker
Found a bug? Have an enhancement request? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
🏗️ Contributing
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
📋 Changelog
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
⚖️ License
This library is provided under the The 3-Clause BSD License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mdeepfri-1.1.10.tar.gz.
File metadata
- Download URL: mdeepfri-1.1.10.tar.gz
- Upload date:
- Size: 222.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb2760cc2cae3b79877dfc6a26015b491995053afa40cea8e0a3048c0acb8d6f
|
|
| MD5 |
0698cf8b1b43a3b4d5f6bac3572a8abc
|
|
| BLAKE2b-256 |
63119debca7209079e238445b5811c3a8dfd7ea4317cdcad91f7a7ecdc23adda
|
Provenance
The following attestation bundles were made for mdeepfri-1.1.10.tar.gz:
Publisher:
release.yml on bioinf-mcb/Metagenomic-DeepFRI
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mdeepfri-1.1.10.tar.gz -
Subject digest:
eb2760cc2cae3b79877dfc6a26015b491995053afa40cea8e0a3048c0acb8d6f - Sigstore transparency entry: 830848011
- Sigstore integration time:
-
Permalink:
bioinf-mcb/Metagenomic-DeepFRI@366acae69245fba7b7bf0673c2c3e618e3e00869 -
Branch / Tag:
refs/tags/v1.1.10 - Owner: https://github.com/bioinf-mcb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@366acae69245fba7b7bf0673c2c3e618e3e00869 -
Trigger Event:
push
-
Statement type: