Molecular docking with SE(3)-equivariant GNN scoring - achieves R=0.88 on PDBbind
Project description
PandaDock - Molecular Docking with GNN Scoring
SE(3)-Equivariant GNN Scoring for Molecular Docking
Installation | Quick Start | Documentation | Benchmark | Citation
Overview
PandaDock v4.0 features a novel SE(3)-equivariant Graph Neural Network (GNN) scoring function that achieves state-of-the-art correlation with experimental binding affinities (R=0.88 on PDBbind, R=0.82 on ULVSH, R=0.81 on BindingDB). The hybrid docking workflow combines traditional pose generation with GNN rescoring to deliver superior accuracy.
Key Features
- PandaDock-GNN: SE(3)-equivariant scoring achieving Pearson R = 0.88 on PDBbind
- Hybrid Docking: Combined pose generation + GNN rescoring (recommended workflow)
- Universal Rescorer: Rescore poses from ANY docking tool (Vina, Glide, GOLD, etc.)
- Vina-Style Scoring: AutoDock Vina empirical weights as default scoring
- Multi-Task Learning: Joint pKd/pEC50 regression + activity classification
- Heterogeneous Graphs: Separate protein/ligand node types with interaction edges
- Specialized Modes: Flexible, metal coordination, and tethered docking
Benchmark Performance
PDBbind v2020 Refined Set (5,316 complexes)
| Metric | Value |
|---|---|
| Pearson R | 0.88 |
| Spearman R | 0.88 |
| RMSE | 0.93 pK units |
| MAE | 0.68 pK units |
| Within 1.0 pK | 77.5% |
| Within 1.5 pK | 90.5% |
ULVSH Dataset (942 compounds, 10 protein targets)
| Method | Type | Pearson R | N |
|---|---|---|---|
| PandaDock-GNN (test) | ML Scoring | 0.82 | 95 |
| PandaDock-GNN (full) | ML Scoring | 0.67 | 942 |
| VM2 | ULVSH Baseline | 0.15 | 942 |
| PM6 | ULVSH Baseline | 0.08 | 939 |
| Hyde | ULVSH Baseline | 0.02 | 942 |
| Gnina | ULVSH Baseline | 0.01 | 941 |
BindingDB Dataset (8,891 protein-ligand complexes)
| Training Configuration | Test Pearson R | Test RMSE | N (train) |
|---|---|---|---|
| BindingDB Only | 0.81 | - | 7,113 |
| BindingDB + ULVSH | 0.79 | 0.96 | 7,866 |
| BindingDB + ULVSH + PDBbind | 0.49 | 1.37 | 12,118 |
Note: Combined training with PDBbind shows reduced performance due to affinity scale differences (pKd vs pEC50). For best results, train on datasets with compatible affinity measurements.
Key Results:
- PandaDock-GNN achieves R = 0.88 on PDBbind (5,316 complexes)
- R = 0.81 on BindingDB test set (889 complexes)
- 5.5x improvement over the best baseline (VM2) on ULVSH
- Activity classification AUC = 0.94 on ULVSH test set
Installation
Prerequisites
- Python 3.8 or higher
- Conda package manager (recommended for RDKit)
Basic Installation
# Clone repository
git clone https://github.com/pritampanda15/PandaDock.git
cd PandaDock
# Create conda environment with RDKit
conda create -n pandadock python=3.10
conda activate pandadock
conda install -c conda-forge rdkit
# Install PandaDock
pip install -e .
GNN Installation (Recommended)
# Install PyTorch and PyTorch Geometric for GNN support
pip install -e ".[gnn]"
# Or manually:
pip install torch torch-geometric torch-scatter torch-sparse
For detailed installation instructions, see INSTALL.md.
Quick Start
Download Pre-trained Model (Recommended)
Get started immediately with the pre-trained model:
# Download the pre-trained model (~82 MB)
pandadock gnn download-model
# Model is saved to models/pandadock_gnn_v4.pt
Hybrid Docking (Recommended)
The hybrid workflow combines traditional pose generation with GNN rescoring for best accuracy:
# Using pre-trained model
pandadock hybrid -r protein.pdb -l ligand.sdf \
--center 10 20 30 --box 20 20 20 \
-m models/pandadock_gnn_v4.pt \
-o results/
# Or train your own model first
pandadock gnn train -d ULVSH/ -o models/ --epochs 100
pandadock hybrid -r protein.pdb -l ligand.sdf \
--center 10 20 30 --box 20 20 20 \
-m models/best_model.pt \
-o results/
Traditional Docking
# Simple docking with Vina-style scoring
pandadock dock -r protein.pdb -l ligand.sdf \
--center 10 20 30 --box 20 20 20 \
-o results/
GNN Prediction Only
# Predict binding affinity for a pre-docked complex
pandadock gnn predict -m model.pt -p protein.mol2 -l ligand.mol2
Universal Rescorer (NEW)
Rescore poses from ANY docking tool using the GNN:
# Rescore poses from AutoDock Vina
pandadock gnn rescore -m model.pt -r receptor.pdb -p vina_out.sdf -o ranked.csv
# Rescore poses from pandadock-flex
pandadock gnn rescore -m model.pt -r protein.pdb -p flex_poses.sdf --output-sdf ranked.sdf
# Rescore poses from Glide, GOLD, or any other tool
pandadock gnn rescore -m model.pt -r protein.pdb -p docked_poses.sdf
Compare Against Baselines
# Benchmark GNN against all baseline methods
pandadock gnn compare -m model.pt -d ULVSH/ -o comparison/
Commands
Core Commands
| Command | Description |
|---|---|
pandadock dock |
Traditional docking with Vina-style scoring |
pandadock hybrid |
Hybrid docking with GNN rescoring (recommended) |
GNN Commands
| Command | Description |
|---|---|
pandadock gnn download-model |
Download pre-trained model (~82 MB) |
pandadock gnn train |
Train GNN model on dataset (ULVSH, PDBbind, or combined) |
pandadock gnn predict |
Predict binding affinity for a single complex |
pandadock gnn rescore |
Universal rescorer for poses from ANY docking tool |
pandadock gnn benchmark |
Benchmark model performance on test set |
pandadock gnn compare |
Compare against baseline scoring methods |
Specialized Docking
| Command | Description |
|---|---|
pandadock-flex |
Flexible/induced-fit docking |
pandadock-metal |
Metal coordination docking |
pandadock-tethered |
Constrained docking near reference |
Utility Tools
| Command | Description |
|---|---|
pandadock-prepare |
Prepare ligands (add H, generate 3D) |
pandadock-gridbox |
Generate grid box configurations |
pandadock-report |
Generate analysis reports |
Universal GNN Rescorer
The pandadock gnn rescore command allows you to rescore docked poses from any docking software using the SE(3)-equivariant GNN:
Supported Input
- AutoDock Vina output (SDF/PDBQT converted to SDF)
- Glide poses (SDF)
- GOLD poses (SDF)
- pandadock-flex flexible docking poses
- pandadock-metal metal coordination poses
- pandadock-tethered constrained poses
- Any multi-conformer SDF file
Usage
pandadock gnn rescore -m model.pt -r receptor.pdb -p poses.sdf [OPTIONS]
Options:
-m, --model PATH Trained GNN model checkpoint (required)
-r, --receptor PATH Receptor PDB or MOL2 file (required)
-p, --poses PATH Multi-conformer SDF with poses (required)
-o, --output PATH Output CSV with ranked poses (default: rescored_poses.csv)
--output-sdf PATH Output SDF with GNN scores as properties
--site-radius FLOAT Binding site extraction radius (default: 10 A)
Example Workflow
# Step 1: Run docking with your preferred tool
vina --receptor protein.pdbqt --ligand ligand.pdbqt --out poses.sdf
# Step 2: Rescore with PandaDock-GNN
pandadock gnn rescore -m model.pt -r protein.pdb -p poses.sdf \
-o ranked.csv --output-sdf ranked.sdf
# Output CSV columns:
# pose_name, pose_index, gnn_pKd, gnn_energy, activity_prob, predicted_active, gnn_rank
Output SDF Properties
When using --output-sdf, each molecule gets these properties:
GNN_pKd- Predicted pKd/pKi valueGNN_Energy- Predicted binding energy (kcal/mol)GNN_Activity- Activity probability (0-1)GNN_Rank- Rank based on GNN score (1 = best)
GNN Architecture
PandaDock-GNN uses an SE(3)-equivariant heterogeneous graph neural network:
Input: Protein-Ligand Complex
|
+-- MOL2/PDB/SDF Parser --> Atom coordinates, types, charges
|
+-- Graph Builder --> HeteroData graph
| - Protein nodes (56 features)
| - Ligand nodes (56 features)
| - Interaction edges (23 features, 5A cutoff)
|
+-- EGNN Layers x 6 (SE(3)-equivariant message passing)
| - Coordinate updates preserve symmetry
| - Edge attention mechanism
|
+-- Attention Pooling --> Graph-level representation
|
+-- Prediction Heads
- pKd/pEC50 regression
- Activity classification (sigmoid)
Node Features (56 dims):
- Atom type one-hot (10)
- SYBYL atom type (16)
- Partial charge (1)
- Hybridization (4)
- Aromaticity, H-bond donor/acceptor (4)
- Residue type (20, protein only)
- Backbone flag (1)
Edge Features (23 dims):
- Distance (1)
- Gaussian RBF expansion (16)
- Bond type one-hot (4)
- Interaction type flags (2)
Scoring Functions
| Function | Description | Use Case |
|---|---|---|
vina |
AutoDock Vina empirical scoring (default) | General docking |
physics_based |
Lennard-Jones + electrostatics | Detailed energy analysis |
Output Files
Dock Command
docking_output/
+-- complex1.pdb, complex2.pdb, ... # Protein-ligand complexes
+-- pose1.pdb, pose2.pdb, ... # Ligand poses only
+-- docking_results.json # Complete results with energies
+-- interaction_analysis.json # Detailed interactions
+-- binding_affinities.png # Affinity distribution
Hybrid Command
hybrid_output/
+-- hybrid_results.csv # Rankings with GNN + Vina scores
+-- pose_1_pec50_X.XX.pdb # Top poses with pEC50 in filename
+-- complex_1.pdb, ... # Protein-ligand complexes
Rescore Command
rescored_poses.csv # Ranked poses with GNN scores
ranked.sdf (optional) # SDF with GNN properties
Training Your Own GNN Model
PandaDock supports training on three dataset formats: ULVSH, PDBbind, and BindingDB. For detailed dataset preparation instructions, see the Dataset Preparation Guide.
Dataset Requirements
| Dataset | Format | Key Files |
|---|---|---|
| ULVSH | Directory | vitro.tsv + protein.mol2, ligand.mol2, site.mol2 per compound |
| PDBbind | Directory | INDEX_refined_data.2020 + {pdb}_pocket.pdb, {pdb}_ligand.mol2 |
| BindingDB | TSV file | TSV with complex_id, protein_file, ligand_file, pK columns |
Single Dataset Training
# Train on ULVSH (942 compounds, 10 targets)
pandadock gnn train -d ULVSH/ -o models/ --epochs 100
# Train on PDBbind (5,316 complexes)
pandadock gnn train -p PDBbind/ -o models/ --epochs 100
# Train on BindingDB (custom TSV file)
pandadock gnn train -b bindingdb_affinity.tsv -o models/ --epochs 100
Combined Dataset Training (Recommended)
Combining datasets improves generalization. Use --balanced to prevent larger datasets from dominating:
# BindingDB + ULVSH (recommended for screening)
pandadock gnn train -b bindingdb.tsv -d ULVSH/ -o models/ \
--balanced --epochs 100
# ULVSH + PDBbind (recommended for structure-based)
pandadock gnn train -d ULVSH/ -p PDBbind/ -o models/ \
--balanced --epochs 200
# All three datasets
pandadock gnn train -d ULVSH/ -p PDBbind/ -b bindingdb.tsv -o models/ \
--balanced --epochs 200 --batch-size 32
Training Options
| Option | Default | Description |
|---|---|---|
--epochs |
100 | Number of training epochs |
--batch-size |
32 | Batch size (reduce if out of memory) |
--hidden-dim |
256 | Hidden layer dimension |
--num-layers |
6 | Number of EGNN layers |
--balanced |
off | Balance sampling across datasets |
--patience |
20 | Early stopping patience |
Benchmark on Test Set
pandadock gnn benchmark -m models/best_model.pt -d ULVSH/ -o results/
Examples
See the examples/ directory:
examples/basic_docking/- Simple docking workflowexamples/flexible_docking/- Induced-fit dockingexamples/metal_docking/- Metalloprotein docking
Documentation
Full documentation available at pandadock.readthedocs.io:
Citation
If you use PandaDock in your research, please cite:
@article{panda2024pandadock,
title={PandaDock: SE(3)-Equivariant Graph Neural Network Scoring for Molecular Docking},
author={Panda, Pritam Kumar},
journal={bioRxiv},
year={2024},
note={Manuscript in preparation}
}
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
License
PandaDock is released under the MIT License. See LICENSE for details.
Contact
Author: Pritam Kumar Panda Affiliation: Stanford University Email: pritampanda@stanford.edu GitHub: @pritampanda15
Acknowledgments
PandaDock builds upon excellent open-source projects:
- AutoDock Vina (scoring function inspiration)
- PyTorch and PyTorch Geometric (GNN framework)
- RDKit (molecular handling)
- E(n)-Equivariant GNN (Satorras et al. 2021)
Star this repository if you find it useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandadock-4.0.2.tar.gz.
File metadata
- Download URL: pandadock-4.0.2.tar.gz
- Upload date:
- Size: 319.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3f623763f5b090ea9e7bf039c02de0f0db9601696e0411c7f1ec17a2dcf694a
|
|
| MD5 |
683d3df26979e30b8d4ab69af40e8852
|
|
| BLAKE2b-256 |
fb06fa6f1fba99a0a72744e667ac36ce97f4f81888e555aeddeb98f155aeb492
|
File details
Details for the file pandadock-4.0.2-py3-none-any.whl.
File metadata
- Download URL: pandadock-4.0.2-py3-none-any.whl
- Upload date:
- Size: 376.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad262915c734aab01a44ab9256d0658b341572ba5d8ccb930199c7ba4a1ca396
|
|
| MD5 |
68be0cfb62175142304fb8df5d048d8f
|
|
| BLAKE2b-256 |
ea1edd14f9d29824ab9118c329243c9507dccd2bda4f1769397be7c08ec7432f
|