Skip to main content

Pipeline for searching and aligning contact maps for proteins, and function prediction with DeepFRI.

Project description

🍳 Metagenomic-DeepFRI Stars

A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.

🔍 Overview

Proteins perform most of the work of living cells. Amino acid sequence and structural features of proteins determine a wide range of functions: from binding specificity and conferring mechanical stability, to catalysis of biochemical reactions, transport, and signal transduction. DeepFRI is a neural network designed to predict protein function within the framework of the Gene Ontology (GO). The exponential growth in the number of available protein sequences, driven by advancements in low-cost sequencing technologies and computational methods (e.g. gene prediction), has resulted in a pressing need for efficient software to facilitate the annotation of protein databases. Metagenomic-DeepFRI addresses such needs, building upon efficient libraries. It incorporates novel databases of predicted structures (AlphaFold, ESMFold, MIP, etc.) and improves runtimes of DeepFRI by 2-12 times!

📋 Pipeline stages

  1. Search proteins similar to query in PDB and supply FoldComp databases with MMSeqs2.
  2. Find the best alignment among MMSeqs2 hits using PyOpal.
  3. Align target protein contact map to query protein with unknown structure.
  4. Run DeepFRI with the structure if found in the database, otherwise run DeepFRI with sequence only.

image.png

🛠️ Built With

🔧 Installation

  1. Download environment YAML.
pip install mdeepfri
  1. Run and view the help message.
mDeepFRI --help

💡 Usage

1. Prepare structural database

1.1 Existing FoldComp databases

The PDB database will be automatically downloaded and installed during the first run of mDeepFRI. The PDB suffers from formatting inconsistencies, therefore during PDB alignment around 10% will fail and will be reported via WARNING. We suggest coupling PDB search with predicted databases, as it massively improves the structural coverage of the protein universe. A good protein structure allows DeepFRI to annotate the function in more detail. However, the sequence branch of the model has the largest weight, thus even if the predicted structure is erroneous, it will have a minor effect on the prediction. The details can be found in the original manuscript, fig. 2A.

You can download additional databases from website. During a first run, FASTA sequences will be extracted from FoldComp database and MMseqs2 database will be created and indexed. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.

Tested databases:

  • afdb_swissprot
  • afdb_swissprot_v4
  • afdb_rep_v4
  • afdb_rep_dark_v4
  • afdb_uniprot_v4
  • esmatlas
  • esmatlas_v2023_02
  • highquality_clust30

ATTENTION: Please, do not rename downloaded databases. FoldComp has certain inconsistencies in the way FASTA sequences are extracted (example), therefore pipeline was tweaked for each database. If database you need does not work, please report in issues and we will add it as soon as possible. Sorry for the inconvenience.

ATTENTION: database creation is a very sensitive step which relies on external software. If pipeline is interrupted during this step, the databases might be corrupted. If you are not sure about your database, rerun the pipeline with --overwrite flag - it will rerun database creation process.

1.2. Custom FoldComp database

In order to use personal database of structures, you will have to create a custom FoldComp database. For that, download a FoldComp executable and run the following command:

foldcomp compress [-t number] <dir|tar(.gz)> [<dir|tar|db>]

2. Download models

Two versions of models available:

  • v1.0 - is the original version from DeepFRI publication.
  • v1.1 - is a version finetuned on AlphaFold models and machine-generated Gene Ontology Uniprot annotations. You can read details about v1.1 in ISMB 2023 presentation by Pawel Szczerbiak

To download models run command:

mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}

3. Predict protein function & capture log

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path 2> log.txt

The logging module writes output into stderr, so use 2> to redirect it to the file. Other available parameters can be found upon command mDeepFRI --help.

✅ Results

The output folder will contain:

  1. {database_name}.search_results.tsv
  2. query.mmseqsDB + index from MMSeqs2 search.
  3. results.tsv - a final output from the DeepFRI model.

Example output (results.tsv)

Protein GO_term/EC_numer Score Annotation Neural_net DeepFRI_mode DB_hit DB_name Identity
MIP_00215364 GO:0016798 0.218 hydrolase activity, acting on glycosyl bonds gcn mf MIP_00215364 mip_rosetta_hq 0.933
1GVH_1 GO:0009055 0.217 electron transfer activity gnn mf AF-P24232-F1-model_v4 afdb_swissprot_v4 1.0
unaligned 3.2.1.- 0.215 3.2.1.- cnn ec nan nan nan

This is an example of protein annotation with the AlphaFold database.

  • Protein - the name of the protein from the FASTA file.
  • GO_term/EC_numer - predicted GO term or EC number (dependent on mode)
  • Score - DeepFRI score, translates to model confidence in prediction. Details in publication.
  • Annotation - annotation from ontology
  • Neural_net - type of neural network used for prediction (gcn = Graph Convolutional Network; cnn = Convolutional Neural Network). GCN (Graph Convolutional Network) is used when structural information is available in the database, allowing for generally more confident predictions. When there are no proteins above similarity cut-off (50% identity by default), CNN is used.
  • DeepFRI_mode:
    mf = molecular_function
    bp = biological_process
    cc = cellular_component
    ec = enzyme_commission
    
  • DB_hit - name of the hit in the database. Empty if no hit was found.
  • DB_name - name of the database. Empty if no hit was found.
  • Identity - sequence identity between query and hit. Empty if no hit was found.

⚙️Features

1. Prediction modes

The GO ontology contains three subontologies, defined by their root nodes:

  • Molecular Function (MF)
  • Biological Process (BP)
  • Cellular Component (CC)
  • Additionally, Metagenomic-DeepFRI v1.0 is able to predict Enzyme Comission number (EC). By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter -p or --processing-modes few times, i.e.:
mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path -p mf -p bp

2. Hierarchical database search

Different databases have a different level of evidence. For example, PDB structures are real experimental structures, thus they are considered to be the data of highest quality. Therefore new proteins are first queried against PDB. Computational predictions differ by quality, i.e. AlphaFold predictions are often more accurate than ESMFold predictions. We provide an opporunity to search multiple databases in a hierarchical manner. For example, if you want to search AlphaFold database first, and then ESMFold, you can pass the parameter -d or --databases few times, i.e.:

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path

3. Temporary files

The first run of mDeepFRI with the database will create temporary files, needed for the pipeline. If you don't want to keep them for the next run add flag --remove-intermediate.

4. CPU / GPU utilization

If argument threads is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation). GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed on your machine, mDeepFRI will automatically use it for prediction. If not, the model will use CPUs. Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.

🔖 Citations

Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the The 3-Clause BSD License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mDeepFRI-1.1.8.tar.gz (41.9 kB view details)

Uploaded Source

Built Distributions

mDeepFRI-1.1.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (48.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

mDeepFRI-1.1.8-cp311-cp311-macosx_13_0_x86_64.whl (48.8 MB view details)

Uploaded CPython 3.11 macOS 13.0+ x86-64

mDeepFRI-1.1.8-cp311-cp311-macosx_13_0_arm64.whl (20.1 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

mDeepFRI-1.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (48.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

mDeepFRI-1.1.8-cp310-cp310-macosx_13_0_x86_64.whl (48.8 MB view details)

Uploaded CPython 3.10 macOS 13.0+ x86-64

mDeepFRI-1.1.8-cp310-cp310-macosx_13_0_arm64.whl (20.1 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

mDeepFRI-1.1.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (48.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

mDeepFRI-1.1.8-cp39-cp39-macosx_13_0_x86_64.whl (48.8 MB view details)

Uploaded CPython 3.9 macOS 13.0+ x86-64

mDeepFRI-1.1.8-cp39-cp39-macosx_13_0_arm64.whl (20.1 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

File details

Details for the file mDeepFRI-1.1.8.tar.gz.

File metadata

  • Download URL: mDeepFRI-1.1.8.tar.gz
  • Upload date:
  • Size: 41.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for mDeepFRI-1.1.8.tar.gz
Algorithm Hash digest
SHA256 442d27d47ded947bd99b454241762f60ba75a28ac93512195deb172b429539a2
MD5 b743925e3c653b5619350aa2cb087752
BLAKE2b-256 6a8a2497ee040aaa32d60e21d4f1c60fe7b8dfadc30bcd938bdc392b7a3d3d66

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 36d2978e9ebd61f8872e4e324e07bc173344f404b49f011ab9cc142fa1373092
MD5 927fd0f5fc7b2d8657f9383345b41283
BLAKE2b-256 0f56b75cf504e227a2d46a4ba4fc8afa45d141df3ffc0cd37b47cd2f15bc3e2b

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp311-cp311-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp311-cp311-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 31cab83d725877185d91f3b4b4b86b2c141e3b1be69e9930f5075575c45579db
MD5 27b22eb5f8a5c4e7e19528558e98028d
BLAKE2b-256 0189bdb37dcc9fbbcea3c2252119bd889ab7214e51b1efe33f76932e841faa1e

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 d47a45c62de5fa50c810449fb89116d8597993076fde7812270f78f46bd1ce7e
MD5 7cd1cf60a6cad519668e61a3af1057c7
BLAKE2b-256 b5979aee7eede4647a708601438f38df7f50f5b7986330b5ff55b655707247d3

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 06dbe721c038ba470f48e4b137d850327246463c63d60d92210dfba6150fdc90
MD5 4b1a15bdc0733e294c795ac568cbc13e
BLAKE2b-256 dac9daa93effd76291b2b807c4278a2d9be4898365b1558be07ab6614138576a

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp310-cp310-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp310-cp310-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 088fd9dd3d09ff52a9eb9a7b23ee9d7d902ab7afbd8e8f083aeb7f67e828a3e5
MD5 f03e5b7eb83aeabc5da68e97cd572f6c
BLAKE2b-256 4e72372a0cf0197425851d0ff7f2ef5c92802b8a5e768ea594a39213d101fffd

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 59ddac4592c19348b5fc0e4b1cd85776492595e5d8972386fee3ef6521a0c259
MD5 5cc3c0b831016cfabe5ab65e9a01961f
BLAKE2b-256 28d0427ad3bd782a002d937856e1be8a1e9fa12493d08a74a65a1fa78fa2a58b

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 afad805c59d717d0439ddfe625b556e415dc4293f865ac988cbc090388449e79
MD5 1c70b75422bc0cdb779208fba0958e4c
BLAKE2b-256 f42bcf1edc1db14e30f19fad4ca61e11a9090de3a70bfc7c962329c000072f5b

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp39-cp39-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp39-cp39-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 be25920bf746fe5588b31c4eba863acd495d44e9005771e8f69a95aab04634b6
MD5 95fe52d3941ae0e7a63fd8036d967d25
BLAKE2b-256 1c0a7dfe85812f348a298d469a3afe952e4224b213d49fa062474a8a9585f04a

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.8-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.8-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 20aef73b6458838dc470a64b3a288efd6006adb3dd228449ad0e23771baf95af
MD5 fdb0ddf18f020b0adccb4b9e1c0efda4
BLAKE2b-256 b3786c6187c63956822520e19329f49a08f3dfb835ba8dd8af32b22f3774cf85

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page