Skip to main content

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

Project description

🍳 Metagenomic-DeepFRI Stars

A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.

🔍 Overview

Proteins perform most of the work of living cells. Amino acid sequence and structural features of proteins determine a wide range of functions: from binding specificity and conferring mechanical stability, to catalysis of biochemical reactions, transport, and signal transduction. DeepFRI is a neural network designed to predict protein function within the framework of the Gene Ontology (GO). The exponential growth in the number of available protein sequences, driven by advancements in low-cost sequencing technologies and computational methods (e.g., gene prediction), has resulted in a pressing need for efficient software to facilitate the annotation of protein databases. Metagenomic-DeepFRI addresses such need, building upon efficient libraries. It incorporates novel databases of predicted structures (AlphaFold, ESMFold, MIP, etc.) and improves runtimes of DeepFRI by 2-12 times!

📋 Pipeline stages

  1. Search proteins similar to query in PDB and supplied FoldComp databases with MMSeqs2.
  2. Find the best alignment among MMSeqs2 hits using PyOpal.
  3. Align target protein contact map to query protein with unknown structure.
  4. Run DeepFRI with structure if it was found in database, otherwise run DeepFRI with sequence only.

🛠️ Built With

🔧 Installation

  1. Clone repo locally
git clone https://github.com/bioinf-mcb/Metagenomic-DeepFRI
cd Metagenomic-DeepFRI
  1. Setup conda environment
conda env create --name deepfri --file environment.yml
conda activate deepfri
  1. Show help message
mDeepFRI --help

💡 Usage

1. Prepare structural database

The PDB database will be automatically downloaded and installed during first run of mDeepFRI. You can download additional databases from website. The app was tested with afdb_swissprot_v4. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.

2. Download models

Two versions of models available:

  • v1.0 - is the original version from DeepFRI publication.
  • v1.1 - is a version finetuned on AlphaFold models and Gene Ontology Uniprot annotations. To download models run command:
mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}

3. Predict protein function & capture log

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path 2> log.txt

The logging module writes output into stderr, so use 2> to redirect it to the file. Other available parameters can be found upon command mDeepFRI --help.

✅ Results

The output folder will contain:

  1. {database_name}.search_results.tsv
  2. query.mmseqsDB + index from MMSeqs2 search.
  3. results.tsv - a final output from the DeepFRI model.

Example output (results.tsv)

Protein GO_term/EC_numer Score Annotation Neural_net DeepFRI_mode DB_hit DB_name Identity
MIP_00215364 GO:0016798 0.218 hydrolase activity, acting on glycosyl bonds gcn mf MIP_00215364 mip_rosetta_hq 0.933
1GVH_1 GO:0009055 0.217 electron transfer activity gnn mf AF-P24232-F1-model_v4 afdb_swissprot_v4 1.0
unaligned 3.2.1.- 0.215 3.2.1.- cnn ec nan nan nan

This is an example of protein annotation with the AlphaFold database.

  • Protein - the name of the protein from the FASTA file.
  • GO_term/EC_numer - predicted GO term or EC number (dependent on mode)
  • Score - DeepFRI score, translates to model confidence in prediction. Details in publication.
  • Annotation - annotation from ontology
  • Neural_net - type of neural network used for prediction (gcn = Graph Convolutional Network; cnn = Convolutional Neural Network). GCN (Graph Convolutional Network) is employed when structural information is available in the database, allowing for generally more confident predictions.
  • DeepFRI_mode:
    mf = molecular_function
    bp = biological_process
    cc = cellular_component
    ec = enzyme_commission
    

⚙️Features

1. Prediction modes

The GO ontology contains three subontologies, defined by their root nodes:

  • Molecular Function (MF)
  • Biological Process (BP)
  • Cellular Component (CC)
  • Additionally, Metagenomic-DeepFRI v1.0 is able to predict Enzyme Comission number (EC). By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter -p or --processing-modes few times, i.e.:
mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path -p mf -p bp

2. Hierarchical database search

Different databases have a different level of evidence. For example, PDB structures are real experimental structures, thus they are considered to be the data of highest quality. Therefore new proteins are first queried against PDB. Computational predictions differ by quality, i.e. AlphaFold predictions are often more accurate than ESMFold predictions. We provide an opporunity to search multiple databases in a hierarchical manner. For example, if you want to search AlphaFold database first, and then ESMFold, you can pass the parameter -d or --databases few times, i.e.:

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path

3. Temporary files

The first run of mDeepFRI with the database will create temporary files, needed for the pipeline. If you don't want to keep them for the next run add flag --remove-intermediate.

4. CPU / GPU utilization

If argument threads is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation). GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed on your machine, mDeepFRI will automatically use it for prediction. If not, the model will use CPUs. Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.

🔖 Citations

Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the The 3-Clause BSD License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mDeepFRI-1.1.4.tar.gz (31.4 kB view details)

Uploaded Source

Built Distributions

mDeepFRI-1.1.4-cp312-cp312-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

mDeepFRI-1.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

mDeepFRI-1.1.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

mDeepFRI-1.1.4-cp311-cp311-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

mDeepFRI-1.1.4-cp311-cp311-macosx_10_9_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

mDeepFRI-1.1.4-cp310-cp310-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

mDeepFRI-1.1.4-cp39-cp39-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

mDeepFRI-1.1.4-cp38-cp38-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

File details

Details for the file mDeepFRI-1.1.4.tar.gz.

File metadata

  • Download URL: mDeepFRI-1.1.4.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for mDeepFRI-1.1.4.tar.gz
Algorithm Hash digest
SHA256 9b41e0d3a4b6848c588384c380c827fe292f2cb56771a7c7c317af99e797733c
MD5 4fccb5dd8133ea04057cfc718988cb37
BLAKE2b-256 b5f94944de197721300aa76cb9c367978c3f5f483e58aad758dcdfadb838994d

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b4350b078a6e611de7f4472b70c460eb9fa8f08dd90dc6767fe760cdd8f75758
MD5 4bcf96a029cfc1a4db816a99915c4e8f
BLAKE2b-256 c2967492a36b7b035783e0745641869edbd51b9aec505324e6176ec2b4c917bf

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 06395952c840bb3102a67aba92527f362e454e9e49074a78371ef56ce39732af
MD5 5ad0a99c60c2e3ba7e8f8efc2c969afb
BLAKE2b-256 ef3803690645c0da3d2eb492c505fddd0b822a37c3fde6cf1772fc01833558c4

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2a24ec6018ebfc76936ea934c6b293b75e6917633ffeeb14d535383145e9629b
MD5 3c8b41c3ae1eaa535523c92d7e871952
BLAKE2b-256 413d5c84127ed1da4450f931900e1ea8bc811a53626d6e4e9212a1e15ea2a2d9

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 92cdc1e0ad9e2eadf23eb4580dcc4b8bad9537436e9647fbd805c6653300dd95
MD5 5d7c17514735b3beaa069cf872a61a57
BLAKE2b-256 74c3de8e0011bb70e7d4a52bbdd9c91f7048a27fc1ce493c7934d2e9ab25cdd5

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 54d443b72cfa959b2d3fb20debf9c593478bb4b17f5ddee06ff152f76bead364
MD5 aae1bf35ba73928316e0fdc1e9d37c74
BLAKE2b-256 692a28cd206dd66393880a0252116792fe618eb2bcd5b29de4b13c6da03b468c

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dcb3e38967b95fc079b35e76686cbf3a7a684a9c46fb4d47fe6f538b03485f08
MD5 80d731942788ba11623c13cc84706918
BLAKE2b-256 98d22cd78b9de95651545512dff72b7de9ce794813926532b7b798cb95b0621c

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 10accc4b939526890aa757f9e679cf6d33e12f2671afd1d76ee234dd077315de
MD5 d53707c4d858d681cc49b1de01d2e650
BLAKE2b-256 b348f32ecf3777f14efbbd4ae8e8f506f24148f976378076ecc5c318d7db4585

See more details on using hashes here.

File details

Details for the file mDeepFRI-1.1.4-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mDeepFRI-1.1.4-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7e5ad2b376d18e10fb4f5fd6f487a5410008fa4006502a55250cf9f7676c6541
MD5 df6f84612e7b020f4384a6d5a30523dc
BLAKE2b-256 95a998a9e5ad324d09444f2c47ccbe6bbf691d50c18e887f537c996aacbe97f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page