Skip to main content

Pipeline for searching and aligning contact maps for proteins, and function prediction with DeepFRI.

Project description

🍳 Metagenomic-DeepFRI Stars

Support Ukraine License PyPI Wheel Python Versions Python Implementations Source GitHub issues Docs Changelog Downloads

A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.

🔍 Overview

Metagenomic-DeepFRI is a high-performance pipeline for annotating protein sequences with Gene Ontology (GO) terms using DeepFRI, a deep learning model for functional protein annotation.

Protein function prediction is increasingly important as sequencing technologies generate vast numbers of novel sequences. Metagenomic-DeepFRI combines:

  • Structure information from FoldComp databases (AlphaFold, ESMFold, PDB, etc.)
  • Sequence-based predictions using DeepFRI's neural networks
  • Fast searches with MMseqs2 for database alignment
  • Significant speedup of 2-12× compared to standard DeepFRI implementation.

📋 Pipeline stages

  1. Search proteins similar to query in PDB and supply FoldComp databases with MMseqs2.
  2. Find the best alignment among MMseqs2 hits using PyOpal.
  3. Align target protein contact map to query protein with unknown structure.
  4. Run DeepFRI with the structure if found in the database, otherwise run DeepFRI with sequence only.

image.png

🛠️ Built With

📦 Requirements

  • Python: >= 3.11, < 3.13 (tested with 3.11)
  • Dependencies: Automatically installed via pip

🔧 Installation

1. Install from PyPI. Installation might take a few minutes due to download of MMseqs2 binaries.

pip install mdeepfri

2. Run and view the help message.

mDeepFRI --help

💡 Usage

1. Prepare structural database

1.1 Existing FoldComp databases

The PDB database will be automatically downloaded and installed during the first run of mDeepFRI. The PDB suffers from formatting inconsistencies, therefore during PDB alignment around 10% will fail and will be reported via WARNING. We suggest coupling PDB search with predicted databases, as it massively improves the structural coverage of the protein universe. A good protein structure allows DeepFRI to annotate the function in more detail. However, the sequence branch of the model has the largest weight, thus even if the predicted structure is erroneous, it will have a minor effect on the prediction. The details can be found in the original manuscript, fig. 2A.

You can download additional databases from website. During a first run, FASTA sequences will be extracted from FoldComp database and MMseqs2 database will be created and indexed. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.

Tested databases:

  • afdb_swissprot
  • afdb_swissprot_v4
  • afdb_rep_v4
  • afdb_rep_dark_v4
  • afdb_uniprot_v4
  • esmatlas
  • esmatlas_v2023_02
  • highquality_clust30

ATTENTION: Please, do not rename downloaded databases. FoldComp has certain inconsistencies in the way FASTA sequences are extracted (example), therefore pipeline was tweaked for each database. If database you need does not work, please report in issues.

ATTENTION: database creation is a very sensitive step which relies on external software. If pipeline is interrupted during this step, the databases might be corrupted. If you are not sure about your database, rerun the pipeline with --overwrite flag - it will rerun database creation process.

1.2. Custom FoldComp database

In order to use personal database of structures, you will have to create a custom FoldComp database. For that, download a FoldComp executable and run the following command:

foldcomp compress [-t number] <dir|tar(.gz)> [<dir|tar|db>]

2. Download models

Two versions of models available:

  • v1.0 - is the original version from DeepFRI publication.
  • v1.1 - is a version finetuned on AlphaFold models and machine-generated Gene Ontology Uniprot annotations. You can read details about v1.1 in ISMB 2023 presentation by Pawel Szczerbiak

To download models run command:

mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}

3. Predict protein function & capture log

mDeepFRI predict-function -i /path/to/protein/sequences \
-d /path/to/foldcomp/database/ \
-w /path/to/deepfri/weights/folder \
-o /output_path > log.txt

Other available parameters can be found upon command mDeepFRI --help.

✅ Results

The output folder will contain several files from different stages of the pipeline:

Main Output Files

  1. results.tsv - Primary output file containing all functional predictions from the DeepFRI model.

  2. alignment_summary.tsv - Summary of alignment statistics for each query protein, showing which queries were successfully aligned to database structures.

  3. database_search/ - Directory containing individual search results for each database queried:

    • {database_name}_results.tsv - One file per database searched (e.g., pdb100_230517_results.tsv, afdb_swissprot_v4_results.tsv)
  4. prediction_matrix_*.tsv - Detailed prediction matrices for each ontology mode:

    • prediction_matrix_bp.tsv - Biological Process predictions
    • prediction_matrix_cc.tsv - Cellular Component predictions
    • prediction_matrix_ec.tsv - Enzyme Commission predictions
    • prediction_matrix_mf.tsv - Molecular Function predictions

    These files contain raw prediction scores for every protein × GO term combination and can be very large (>50MB).

  5. query.mmseqsDB + associated index files - MMseqs2 database created from input query sequences.

Primary Output Format (results.tsv)

The main output file contains the following columns:

  • protein - Name of the protein from the input FASTA file.
  • network_type - Type of neural network used for prediction:
    • gcn (Graph Convolutional Network) - Used when structural information is available from database alignment, providing more confident predictions.
    • cnn (Convolutional Neural Network) - Used when no proteins above similarity cutoff (50% identity by default) are found.
  • prediction_mode - Ontology category: mf (Molecular Function), bp (Biological Process), cc (Cellular Component), or ec (Enzyme Commission).
  • go_term - Predicted GO term identifier or EC number.
  • score - DeepFRI confidence score for the prediction. Higher scores indicate greater confidence. See the DeepFRI publication for details.
  • go_name - Human-readable annotation from the Gene Ontology or EC nomenclature.
  • aligned - Boolean indicating whether the query was successfully aligned to a database structure (True/False).
  • target_id - Identifier of the matched database entry (e.g., 3al6_D for PDB chain). Empty if no hit was found.
  • db_name - Name of the database where the match was found (e.g., pdb100_230517, afdb_swissprot_v4).
  • query_identity - Sequence identity percentage between query and hit (0.0-1.0 scale). Empty if no hit was found.
  • query_coverage - Proportion of query sequence covered by the alignment (0.0-1.0 scale).
  • target_coverage - Proportion of target sequence covered by the alignment (0.0-1.0 scale).

⚙️Features

1. Prediction modes

The GO ontology contains three subontologies, defined by their root nodes:

  • Molecular Function (MF)
  • Biological Process (BP)
  • Cellular Component (CC)
  • Additionally, Metagenomic-DeepFRI v1.0 can predict Enzyme Commission (EC) numbers.

By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter -p or --processing-modes few times, i.e.:

mDeepFRI predict-function -i /path/to/protein/sequences \
  -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder \
  -o /output_path -p mf -p bp

2. Hierarchical database search

Different databases have different levels of evidence. For example, PDB structures are real experimental structures and are considered highest quality data. New proteins are first queried against PDB. Computational predictions differ by quality (e.g., AlphaFold predictions are often more accurate than ESMFold). You can search multiple databases hierarchically for flexibility. For example, to search AlphaFold first, then ESMFold, pass the parameter -d or --databases multiple times:

mDeepFRI predict-function -i /path/to/protein/sequences \
  -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ \
  -w /path/to/deepfri/weights/folder -o /output_path

3. Temporary files

The first run of mDeepFRI with the database will create temporary files needed for the pipeline. If you don't want to keep them for the next run, add flag --remove-intermediate.

4. Skipping prediction matrices

By default, mDeepFRI writes detailed prediction matrix files (prediction_matrix_*.tsv) containing raw scores for every protein × GO term combination. These files can be very large (>50MB each). If you only need the final results.tsv file and want to save disk space, use the --skip-matrix flag:

mDeepFRI predict-function -i /path/to/protein/sequences \
  -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder \
  -o /output_path --skip-matrix

5. CPU / GPU utilization

If argument threads is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation). GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed, mDeepFRI will automatically use it for prediction. Otherwise, the model will use CPUs.

Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.

Troubleshooting GPU usage: If onnxruntime cannot find CUDA libraries despite them being installed, you might see errors like:

[W:onnxruntime:Default, onnxruntime_pybind_state.cc:1013 CreateExecutionProviderFactoryInstance]
Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*.

To fix this, add the library paths to LD_LIBRARY_PATH. If you installed nvidia-* packages via pip, you can dynamically find and export the paths:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(python -c 'import os, nvidia.cudnn, nvidia.cublas, nvidia.cuda_runtime; libs=[nvidia.cudnn, nvidia.cublas, nvidia.cuda_runtime]; print(":".join([os.path.join(m.__path__[0], "lib") for m in libs]))')

🔖 Citations

Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:

💭 Feedback

⚠️ Issue Tracker

Found a bug? Have an enhancement request? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the The 3-Clause BSD License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdeepfri-1.1.10.tar.gz (222.0 kB view details)

Uploaded Source

File details

Details for the file mdeepfri-1.1.10.tar.gz.

File metadata

  • Download URL: mdeepfri-1.1.10.tar.gz
  • Upload date:
  • Size: 222.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mdeepfri-1.1.10.tar.gz
Algorithm Hash digest
SHA256 eb2760cc2cae3b79877dfc6a26015b491995053afa40cea8e0a3048c0acb8d6f
MD5 0698cf8b1b43a3b4d5f6bac3572a8abc
BLAKE2b-256 63119debca7209079e238445b5811c3a8dfd7ea4317cdcad91f7a7ecdc23adda

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdeepfri-1.1.10.tar.gz:

Publisher: release.yml on bioinf-mcb/Metagenomic-DeepFRI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page