Skip to main content

Graph Network for protein-protein interface including language model features

Project description

deeprank-gnn-esm

Graph Network for protein-protein interface including language model features.

GitHub License

ci Codacy Badge Codacy Badge

For details refer to our publication at https://academic.oup.com/bioinformaticsadvances/article/4/1/vbad191/7511844

For detailed protocol to use our deeprank-gnn-esm software, refer to our publication at https://arxiv.org/abs/2407.16375

Installation

pip install deeprank-gnn-esm

CPU only

To avoid downloading the heavy CUDA libraries (~3GB), install the CPU-only torch first:

pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
pip install deeprank-gnn-esm

GPU support

GPU support is included automatically — the default PyPI torch wheel bundles CUDA. If your system requires a specific CUDA version, install torch first:

# example for CUDA 12.1
pip install torch --extra-index-url https://download.pytorch.org/whl/cu121
pip install deeprank-gnn-esm

Check pytorch.org for the right CUDA version for your system.

Usage

As a scoring function

We provide a command-line interface for deeprank-gnn-esm that can easily be used to score protein-protein complexes. The command-line interface can be used as follows:

$ deeprank-gnn-esm-predict -h
usage: deeprank-gnn-esm-predict [-h] pdb_file chain_id_1 chain_id_2 num_cores

positional arguments:
  pdb_file    Path to the PDB file.
  chain_id_1  First chain ID.
  chain_id_2  Second chain ID.
  num_cores   Number of cores 

optional arguments:
  -h, --help  show this help message and exit

Example, score the 1B6C complex

# download it
$ wget https://files.rcsb.org/view/1B6C.pdb -q

$ deeprank-gnn-esm-predict 1B6C.pdb A B 1
 2023-06-28 06:08:21,889 predict:64 INFO - Setting up workspace - /home/deeprank-gnn-esm/1B6C-gnn_esm_pred_A_B
 2023-06-28 06:08:21,945 predict:72 INFO - Renumbering PDB file.
 2023-06-28 06:08:22,294 predict:104 INFO - Reading sequence of PDB 1B6C.pdb
 2023-06-28 06:08:22,423 predict:131 INFO - Generating embedding for protein sequence.
 2023-06-28 06:08:22,423 predict:132 INFO - ################################################################################
 2023-06-28 06:08:32,447 predict:138 INFO - Transferred model to GPU
 2023-06-28 06:08:32,450 predict:147 INFO - Read /home/1B6C-gnn_esm_pred_A_B/all.fasta with 2 sequences
 2023-06-28 06:08:32,459 predict:157 INFO - Processing 1 of 1 batches (2 sequences)
 2023-06-28 06:08:36,462 predict:200 INFO - ################################################################################
 2023-06-28 06:08:36,470 predict:205 INFO - Generating graph, using 79 processors
 Graphs added to the HDF5 file
 Embedding added to the /home/1B6C-gnn_esm_pred_A_B/graph.hdf5 file file
 2023-06-28 06:09:03,345 predict:220 INFO - Graph file generated: /home/deeprank-gnn-esm/1B6C-gnn_esm_pred_A_B/graph.hdf5
 2023-06-28 06:09:03,345 predict:226 INFO - Predicting fnat of protein complex.
 2023-06-28 06:09:03,345 predict:234 INFO - Using device: cuda:0
 # ...
 2023-06-28 06:09:07,794 predict:280 INFO - Predicted fnat for 1B6C between chainA and chainB: 0.359
 2023-06-28 06:09:07,803 predict:290 INFO - Output written to /home/deeprank-gnn-esm/1B6C-gnn_esm_pred/GNN_esm_prediction.csv

From the output above you can see that the predicted fnat for the 1B6C complex is 0.359, this information is also written to the GNN_esm_prediction.csv file.

The command above will generate a folder in the current working directory, containing the following:

1B6C-gnn_esm_pred_A_B
├── 1B6C.pdb                   #input pdb file
├── all.fasta                  #fasta sequence for the pdb input
├── 1B6C.A.pt                  #esm-2 embedding for chainA in protein 1B6C
├── 1B6C.B.pt                  #esm-2 embedding for chainB in protein 1B6C
├── graph.hdf5                 #input protein graph in hdf5 format
├── GNN_esm_prediction.hdf5    #prediction output in hdf5 format
└── GNN_esm_prediction.csv     #prediction output in csv format

As a framework

Note about input pdb files

To ensure the mapping between interface residue and esm-2 embeddings is correct, make sure that for all the chains, residue numbering in the PDB file is continuous and starts with residue '1'.

We provide a script (scripts/pdb_renumber.py) to do the numbering.

Generate esm-2 embeddings for your protein

  • To generate fasta sequences from PDBs, use script get_fasta.py

    usage: get_fasta.py [-h] pdb_file_path chain_id1 chain_id2
    
    positional arguments:
      pdb_file_path  Path to the directory containing PDB files
      chain_id1      Chain ID for the first sequence
      chain_id2      Chain ID for the second sequence
    
    options:
      -h, --help         show this help message and exit
    
    
    python scripts/get_fasta.py tests/data/pdb/1ATN/ A B
    
  • Generate embeddings in bulk from combined fasta files, use the script provided inside esm-2 package,

    $ python esm_2_installation_location/scripts/extract.py \
        esm2_t33_650M_UR50D \
        all.fasta \
        tests/data/embedding/1ATN/ \
        --repr_layers 0 32 33 \
        --include mean per_tok
    

    Replace 'esm_2_installation_location' with your installation location, 'all.fasta' with fasta sequence generated above, 'tests/data/embedding/1ATN/' with the output folder name for esm embeddings

Generate graph

  • Example code to generate residue graphs in hdf5 format:

    from deeprank_gnn.GraphGenMP import GraphHDF5
    
    pdb_path = "tests/data/pdb/1ATN/"
    pssm_path = "tests/data/pssm/1ATN/"
    embedding_path = "tests/data/embedding/1ATN/"
    nproc = 20
    outfile = "1ATN_residue.hdf5"
    
    GraphHDF5(
        pdb_path = pdb_path,
        pssm_path = pssm_path,
        embedding_path = embedding_path,
        graph_type = "residue",
        outfile = outfile,
        nproc = nproc,    #number of cores to use
        tmpdir="./tmpdir")
    
  • Example code to add continuous or binary targets to the hdf5 file

    import h5py
    import random
    
    hdf5_file = h5py.File('1ATN_residue.hdf5', "r+")
    for mol in hdf5_file.keys():
        fnat = random.random()
        bin_class = [1 if fnat > 0.3 else 0]
        hdf5_file.create_dataset(f"/{mol}/score/binclass", data=bin_class)
        hdf5_file.create_dataset(f"/{mol}/score/fnat", data=fnat)
    hdf5_file.close()
    

Use pre-trained models to predict

  • Example code to use pre-trained deeprank-gnn-esm model

    from deeprank_gnn.ginet import GINet
    from deeprank_gnn.NeuralNet import NeuralNet
    
    database_test = "1ATN_residue.hdf5"
    gnn = GINet
    target = "fnat"
    edge_attr = ["dist"]
    threshold = 0.3
    pretrained_model = 'deeprank-GNN-esm/paper_pretrained_models/scoring_of_docking_models/gnn_esm/treg_yfnat_b64_e20_lr0.001_foldall_esm.pth.tar'
    node_feature = ["type", "polarity", "bsa", "charge", "embedding"]
    device_name = "cuda:0"
    num_workers = 10
    
    model = NeuralNet(
        database_test,
        gnn,
        device_name = device_name,
        edge_feature = edge_attr,
        node_feature = node_feature,
        target = target,
        num_workers = num_workers,
        pretrained_model = pretrained_model,
        threshold = threshold)
    
    model.test(hdf5 = "tmpdir/GNN_esm_prediction.hdf5")
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeprank_gnn_esm-1.0.1.tar.gz (635.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeprank_gnn_esm-1.0.1-py3-none-any.whl (631.1 kB view details)

Uploaded Python 3

File details

Details for the file deeprank_gnn_esm-1.0.1.tar.gz.

File metadata

  • Download URL: deeprank_gnn_esm-1.0.1.tar.gz
  • Upload date:
  • Size: 635.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deeprank_gnn_esm-1.0.1.tar.gz
Algorithm Hash digest
SHA256 bb4feea0762c3b0838d0676e69b51c52abd985d16fb2ff5931c18f42e66918ec
MD5 c6e2ca01f5710e888091efc96bb69602
BLAKE2b-256 f41ea6191282bc3da30ae60ddd923898b74ab11fb2713132eecc0361eaed4165

See more details on using hashes here.

Provenance

The following attestation bundles were made for deeprank_gnn_esm-1.0.1.tar.gz:

Publisher: publish.yml on haddocking/deeprank-gnn-esm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file deeprank_gnn_esm-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for deeprank_gnn_esm-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8a770760009463632d4e7a95cdb65294dcd5444ccd2bafc3a2f4784196461ce
MD5 6ec75d0e741d4ae5effadf8197a5d9ca
BLAKE2b-256 315d358462d3371e763436170c56d977df20fe64c76a62aa44a163e072d4c81e

See more details on using hashes here.

Provenance

The following attestation bundles were made for deeprank_gnn_esm-1.0.1-py3-none-any.whl:

Publisher: publish.yml on haddocking/deeprank-gnn-esm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page