Skip to main content

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

Project description

🍳 Metagenomic-DeepFRI Stars

A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.

🔍 Overview

Proteins perform most of the work of living cells. Amino acid sequence and structural features of proteins determine a wide range of functions: from binding specificity and conferring mechanical stability, to catalysis of biochemical reactions, transport, and signal transduction. DeepFRI is a neural network designed to predict protein function within the framework of the Gene Ontology (GO). The exponential growth in the number of available protein sequences, driven by advancements in low-cost sequencing technologies and computational methods (e.g., gene prediction), has resulted in a pressing need for efficient software to facilitate the annotation of protein databases. Metagenomic-DeepFRI addresses such need, building upon efficient libraries. It incorporates novel databases of predicted structures (AlphaFold, ESMFold, MIP, etc.) and improves runtimes of DeepFRI by 2-12 times!

📋 Pipeline stages

  1. Search proteins similar to query in PDB and supplied FoldComp databases with MMSeqs2.
  2. Find the best alignment among MMSeqs2 hits using PyOpal.
  3. Align target protein contact map to query protein with unknown structure.
  4. Run DeepFRI with structure if it was found in database, otherwise run DeepFRI with sequence only.

🛠️ Built With

🔧 Installation

  1. Clone repo locally
git clone https://github.com/bioinf-mcb/Metagenomic-DeepFRI
cd Metagenomic-DeepFRI
  1. Setup conda environment
conda env create --name deepfri --file environment.yml
conda activate deepfri
  1. Show help message
mDeepFRI --help

💡 Usage

1. Prepare structural database

The PDB database will be automatically downloaded and installed during first run of mDeepFRI. You can download additional databases from website. The app was tested with afdb_swissprot_v4. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.

2. Download models

Two versions of models available:

  • v1.0 - is the original version from DeepFRI publication.
  • v1.1 - is a version finetuned on AlphaFold models and Gene Ontology Uniprot annotations. To download models run command:
mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}

3. Predict protein function & capture log

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path 2> log.txt

The logging module writes output into stderr, so use 2> to redirect it to the file. Other available parameters can be found upon command mDeepFRI --help.

✅ Results

The output folder will contain:

  1. {database_name}.search_results.tsv
  2. query.mmseqsDB + index from MMSeqs2 search.
  3. results.tsv - a final output from the DeepFRI model.

Example output (results.tsv)

Protein GO_term/EC_numer Score Annotation Neural_net DeepFRI_mode DB_hit DB_name Identity
MIP_00215364 GO:0016798 0.218 hydrolase activity, acting on glycosyl bonds gcn mf MIP_00215364 mip_rosetta_hq 0.933
1GVH_1 GO:0009055 0.217 electron transfer activity gnn mf AF-P24232-F1-model_v4 afdb_swissprot_v4 1.0
unaligned 3.2.1.- 0.215 3.2.1.- cnn ec nan nan nan

This is an example of protein annotation with the AlphaFold database.

  • Protein - the name of the protein from the FASTA file.
  • GO_term/EC_numer - predicted GO term or EC number (dependent on mode)
  • Score - DeepFRI score, translates to model confidence in prediction. Details in publication.
  • Annotation - annotation from ontology
  • Neural_net - type of neural network used for prediction (gcn = Graph Convolutional Network; cnn = Convolutional Neural Network). GCN (Graph Convolutional Network) is employed when structural information is available in the database, allowing for generally more confident predictions.
  • DeepFRI_mode:
    mf = molecular_function
    bp = biological_process
    cc = cellular_component
    ec = enzyme_commission
    

⚙️Features

1. Prediction modes

The GO ontology contains three subontologies, defined by their root nodes:

  • Molecular Function (MF)
  • Biological Process (BP)
  • Cellular Component (CC)
  • Additionally, Metagenomic-DeepFRI v1.0 is able to predict Enzyme Comission number (EC). By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter -p or --processing-modes few times, i.e.:
mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path -p mf -p bp

2. Hierarchical database search

Different databases have a different level of evidence. For example, PDB structures are real experimental structures, thus they are considered to be the data of highest quality. Therefore new proteins are first queried against PDB. Computational predictions differ by quality, i.e. AlphaFold predictions are often more accurate than ESMFold predictions. We provide an opporunity to search multiple databases in a hierarchical manner. For example, if you want to search AlphaFold database first, and then ESMFold, you can pass the parameter -d or --databases few times, i.e.:

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path

3. Temporary files

The first run of mDeepFRI with the database will create temporary files, needed for the pipeline. If you don't want to keep them for the next run add flag --remove-intermediate.

4. CPU / GPU utilization

If argument threads is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation). GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed on your machine, mDeepFRI will automatically use it for prediction. If not, the model will use CPUs. Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.

🔖 Citations

Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the The 3-Clause BSD License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mDeepFRI-1.1.3.tar.gz (30.6 kB view details)

Uploaded Source

Built Distributions

mDeepFRI-1.1.3-cp312-cp312-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

mDeepFRI-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

mDeepFRI-1.1.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

mDeepFRI-1.1.3-cp311-cp311-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

mDeepFRI-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

mDeepFRI-1.1.3-cp310-cp310-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

mDeepFRI-1.1.3-cp39-cp39-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

mDeepFRI-1.1.3-cp38-cp38-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

File details

Details for the file mDeepFRI-1.1.3.tar.gz.

File metadata

  • Download URL: mDeepFRI-1.1.3.tar.gz
  • Upload date:
  • Size: 30.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for mDeepFRI-1.1.3.tar.gz
Algorithm Hash digest
SHA256 a92ee33ccc5f8f2f4a1d0566bdcd70c9a250799503e4c40d46d060f66fcb7e5d
MD5 35e9f74c17d46a128e6b0fdcb633b48b
BLAKE2b-256 cc30887a5a433012485b93479fd54feb56a77b7ccc917d93360969bbcca83009

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 05e0badf85ceccd86b86c93684cb410452b13366c2be986a86758769f4f59ddb
MD5 1472e51b67a3ccf081605b76f44c2360
BLAKE2b-256 8c6c74e1ca7c17fc0483686587b987632120c362b051ac73b5a7724a80d0db0d

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bb65307783c3cc4d515f7fa1ebc944c070de5b5ab16d299110ef0864ade5cb21
MD5 e826d387823b89816fb19a77f6d73f67
BLAKE2b-256 9c0fa9f1e3a8dd2c51eae4cf5c8bf6f1a542570ae0fbe9e896348b1511bbe193

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f2e07ef48699d5b36375a8ba71e849e7cdf0d6f95dfc4216b8726da0299d5646
MD5 e2ca9a433f0b65b92930d0c489087726
BLAKE2b-256 e93c4b612429f9685fa3465efb4d0c20c58e5759f9ecc835ed22a57686f18561

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 24559ac77819a73b68dc85847a45840cd2c6384b24ece46444da9ee20b2db774
MD5 55ad9596c2dc1dc2b9ef1ec2436cf38b
BLAKE2b-256 745725d63e20c16a28a5cddaae6b7e73409a0e371a90a1dd9f3ebf46f04a31e2

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bdd71b5d62dfd0c0042667db9e9d1389a897f65302542ce135c8733a6e2f5514
MD5 04cd00f6ffb4c7ee91353e5f48c4f641
BLAKE2b-256 2e8dbdd32bdeddd1adc6f44c78a06cf13f480d2e1f0e41d03b84a296f6a47eb2

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cf0afd022dc48378cb260d8aebcf8873d21ade77e7796b5b61d1ea796b5f6b3f
MD5 eb643d7d0d59d94327a65f876f217e4f
BLAKE2b-256 efd6e78eea91ab4033475ffb3f3cc79c64b032b6385c45bcd6f151abf6263e9e

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a86dca03a4d18a8e6b98ef1d074a83b03865b2ac1c937dc8603bcde8f78d6e19
MD5 5db6ac516cbc9152fae5eacf6b37c367
BLAKE2b-256 2037446b47eb9a2491c948dab1275ea61648df29f80b368d1bd339bba8e1d552

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.3-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.3-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e53fdaee129ed7dbf3918a742ec2fcee46ffb5804b58d66c5a3aeb37a285ee3b
MD5 9d14ebbc03cfa73c5890fe445c9a3651
BLAKE2b-256 4f48e8c23e3b9faeb63c9203ce1730cc5220b4b3ac59c1da92a14ef501e2a7ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page