Skip to main content

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

Project description

🍳 Metagenomic-DeepFRI Stars

A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.

🔍 Overview

Proteins perform most of the work of living cells. Amino acid sequence and structural features of proteins determine a wide range of functions: from binding specificity and conferring mechanical stability, to catalysis of biochemical reactions, transport, and signal transduction. DeepFRI is a neural network designed to predict protein function within the framework of the Gene Ontology (GO). The exponential growth in the number of available protein sequences, driven by advancements in low-cost sequencing technologies and computational methods (e.g., gene prediction), has resulted in a pressing need for efficient software to facilitate the annotation of protein databases. Metagenomic-DeepFRI addresses such need, building upon efficient libraries. It incorporates novel databases of predicted structures (AlphaFold, ESMFold, MIP, etc.) and improves runtimes of DeepFRI by 2-12 times!

📋 Pipeline stages

  1. Search proteins similar to query in PDB and supplied FoldComp databases with MMSeqs2.
  2. Find the best alignment among MMSeqs2 hits using PyOpal.
  3. Align target protein contact map to query protein with unknown structure.
  4. Run DeepFRI with structure if it was found in database, otherwise run DeepFRI with sequence only.

🛠️ Built With

🔧 Installation

  1. Clone repo locally
git clone https://github.com/bioinf-mcb/Metagenomic-DeepFRI
cd Metagenomic-DeepFRI
  1. Setup conda environment
conda env create --name deepfri --file environment.yml
conda activate deepfri
  1. Show help message
mDeepFRI --help

💡 Usage

1. Prepare structural database

The PDB database will be automatically downloaded and installed during first run of mDeepFRI. You can download additional databases from website. The app was tested with afdb_swissprot_v4. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.

2. Download models

Two versions of models available:

  • v1.0 - is the original version from DeepFRI publication.
  • v1.1 - is a version finetuned on AlphaFold models and Gene Ontology Uniprot annotations. To download models run command:
mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}

3. Predict protein function & capture log

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path 2> log.txt

The logging module writes output into stderr, so use 2> to redirect it to the file. Other available parameters can be found upon command mDeepFRI --help.

✅ Results

The output folder will contain:

  1. {database_name}.search_results.tsv
  2. metadata_skipped_ids_due_to_length.json - too long or too short queries (DeepFRI is designed to predict the function of proteins in the range of 60-1000 aa).
  3. query.mmseqsDB + index from MMSeqs2 search.
  4. results.tsv - a final output from the DeepFRI model.

Example output (results.tsv)

Protein GO_term/EC_numer Score Annotation Neural_net DeepFRI_mode DB_hit DB_name Identity
MIP_00215364 GO:0016798 0.218 hydrolase activity, acting on glycosyl bonds gcn mf MIP_00215364 mip_rosetta_hq 0.933
1GVH_1 GO:0009055 0.217 electron transfer activity gnn mf AF-P24232-F1-model_v4 afdb_swissprot_v4 1.0
unaligned 3.2.1.- 0.215 3.2.1.- cnn ec nan nan nan

This is an example of protein annotation with the AlphaFold database.

  • Protein - the name of the protein from the FASTA file.
  • GO_term/EC_numer - predicted GO term or EC number (dependent on mode)
  • Score - DeepFRI score, translates to model confidence in prediction. Details in publication.
  • Annotation - annotation from ontology
  • Neural_net - type of neural network used for prediction (gcn = Graph Convolutional Network; cnn = Convolutional Neural Network). GCN (Graph Convolutional Network) is employed when structural information is available in the database, allowing for generally more confident predictions.
  • DeepFRI_mode:
    mf = molecular_function
    bp = biological_process
    cc = cellular_component
    ec = enzyme_commission
    

⚙️Features

1. Prediction modes

The GO ontology contains three subontologies, defined by their root nodes:

  • Molecular Function (MF)
  • Biological Process (BP)
  • Cellular Component (CC)
  • Additionally, Metagenomic-DeepFRI v1.0 is able to predict Enzyme Comission number (EC). By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter -p or --processing-modes few times, i.e.:
mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path -p mf -p bp

2. Hierarchical database search

Different databases have a different level of evidence. For example, PDB structures are real experimental structures, thus they are considered to be the data of highest quality. Therefore new proteins are first queried against PDB. Computational predictions differ by quality, i.e. AlphaFold predictions are often more accurate than ESMFold predictions. We provide an opporunity to search multiple databases in a hierarchical manner. For example, if you want to search AlphaFold database first, and then ESMFold, you can pass the parameter -d or --databases few times, i.e.:

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path

3. Temporary files

The first run of mDeepFRI with the database will create temporary files, needed for the pipeline. If you don't want to keep them for the next run add flag --remove-intermediate.

4. CPU / GPU utilization

If argument threads is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation). GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed on your machine, mDeepFRI will automatically use it for prediction. If not, the model will use CPUs. Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.

🔖 Citations

Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the The 3-Clause BSD License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mDeepFRI-1.1.2.tar.gz (30.8 kB view details)

Uploaded Source

Built Distributions

mDeepFRI-1.1.2-cp312-cp312-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

mDeepFRI-1.1.2-cp311-cp311-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

mDeepFRI-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

mDeepFRI-1.1.2-cp310-cp310-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

mDeepFRI-1.1.2-cp39-cp39-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

mDeepFRI-1.1.2-cp38-cp38-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

File details

Details for the file mDeepFRI-1.1.2.tar.gz.

File metadata

  • Download URL: mDeepFRI-1.1.2.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for mDeepFRI-1.1.2.tar.gz
Algorithm Hash digest
SHA256 c29298539f2b92c00b0d855f50cceae25c33aaf2d7f926807f3b8d0c0d8f4759
MD5 7a61ccac6dfe5fc5bcc7d4c8e63427e3
BLAKE2b-256 9556879d3c34ae438f98557e79c35cb9cb4cfd324f3cedc972551bdae367c853

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b93f2d05723d0ad6afc28391b25c8271abd19b3bd599e4f0d14d1accb1e40810
MD5 e62e7e3b2a0caf261ce0ad7610bc562e
BLAKE2b-256 576acc3a1764f5b2c88b5697c8edc69e070cd029bcbd96cc3f24c507c5a592d0

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b3cc87bd8887ab11fa10f9c7114e95ba306c1302b1a51ad3b14ba2e738eb748
MD5 135513688828f32b7dc99313e176243b
BLAKE2b-256 6863c15ca8071600272e336bc259bfa70a33b6770ae8f3e8e95410ae0067fb30

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d6fc826497ee5bd98c4be5f1f844e2b0d799c11b603d14c3ec03298f9a6c9d40
MD5 2a8fa461e68dab80e508f7a252549266
BLAKE2b-256 f7b7f6c35f94dcef886d053b9a3ef7b51c8d607463c5e750dae8c1f5ea87b40b

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9021de8e686b6faf291c8e07699bfb956a9ec878cbd2a8bd45872057ee103f26
MD5 41a89dd26740d6af8cee291586512931
BLAKE2b-256 17dc68cad182822e2da4cc2717acad612fd4faa3098d88eff5f081b97180274a

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 197ee8cb7119413f92a38f578cc5be22299eb992dd11ade936a6e3b481318851
MD5 cc67f88f4edddc2a89365d9b8db942a5
BLAKE2b-256 7432470ac0628173eee3c529cb700e72b3a72747630a9b475e91234156d55c00

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6a6a495f066843349859a981324f392140c081403d45279550393e58682e53e6
MD5 cfcdb9917542ca462afa97eb72e2ae69
BLAKE2b-256 3782b4cf652ad164eb1ec6d08285af05cbfdf996c10b502bc4bfb99f5df0f08f

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 abe170ec0fae6ce3e600a6b2871dd6fb303745fe9d26fc50e476db88f376e483
MD5 35d56a956a3ae8c98d65bbc8bfe2706f
BLAKE2b-256 7b4bbc5d9c203c5c7b0115e7ff1b7724b0c45c0f86bf2ae5b0d19a8bb949f0bd

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.2-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ca9ea38d205ce0fe22fd5d5e7e48bf1ec9c92152ca73860fbd9d3ee9ea1ccf22
MD5 09bfeb8a21cbaa2195ff59a1dfa90ee0
BLAKE2b-256 d485ee6cd6ff2c88663db3dc95e48459f5055a052e3f8b52ff4feb478ba1c732

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page