Skip to main content

A 3D Protein Pocket Detection & Classification Pipeline using geometry and electrostatic methods

Project description

BSPDA: Binding-Site(s) Predictor by Diego and Alberto

Authors: Diego Maquieira Vidal and Alberto Domingo Gómez

Introduction

Welcome to BSPDA (our PYT-SBI project). The aim of this pipeline is to predict the position of binding sites within the 3D structure of a given protein model.

While the primary focus is geometric prediction (locating surface cavities where a ligand could bind), BSPDA also performs complex chemical analysis. Using Solvent Accessible Surface Area (SASA) and studying partial charges, the pipeline suggests the type of ligand involved and lists the specific amino acids lining the pocket.

PocketPicker tool was used as the main reference to build BSPDA [1].

This tutorial will walk you through the modular architecture of the pipeline.


Prerequisites

Before running this tutorial, ensure you have propperly installed the package.

# Import the packages needed for this markdown.
import numpy as np
from rdkit import Chem
# Import the modular classes of the BSPDA pipeline
from structure import ProteinStructure
from geometry import GridGenerator
from detector import PocketDetector
from classifier import PocketClassifier
from io import Exporter

Step 1: Setup and Data Preparation

First, it is needed to download the target protein and process it. PDB files often contain "noise" for the methods of this pipeline, such as crystallized water molecules, buffer ions or existing ligands.

If hetero-atoms are not eliminated, the geometric algorithm will view the pockets as "already full" and fail to detect them.

print("1. Initializing and Cleaning Protein Structure...")

# Define output directory and target PDB ID
output_dir = "eval_results/1bid"
pdb_target = "1bid"

# Initialize the Structure object
struct = ProteinStructure(pdb_id=pdb_target, work_dir=output_dir)

# Download the PDB and load it into RDKit
struct.downloadPDB()

# Strip water, ions, and co-crystallized ligands to leave empty cavities
struct.strip_htm()

# Calculate Van der Waals radii for all remaining protein atoms
ptable = Chem.GetPeriodicTable()
vdw_radii = np.array([ptable.GetRvdw(atom.GetAtomicNum()) for atom in struct.mol.GetAtoms()])

print(f"Successfully loaded {pdb_target} with {struct.coords.shape[0]} heavy atoms.")

Step 2: 3D Grid Generation & Flood Fill

Before searching pockets, empty spaces need to be defined. The GridGenerator creates a 3D bounding box around the protein and fills it with a voxel grid (1.0 Å resolution). It then measures the distance from every voxel to the nearest protein atom.

Using a 3D flood-fill algorithm, it distinguishes between the infinite bulk solvent outside the protein and the deep clefts/internal cavities.

print("2. Generating 3D Grid and Isolating Surface Probes...")

# Initialize grid generator with protein coordinates and VDW radii
grid_gen = GridGenerator(struct.coords, vdw_radii, resolution=1.0, padding=8.0)

# Apply flood-fill to find points that are empty, connected to the surface, but near the protein
surface_probes = grid_gen.get_valid_surface_points()

print(f"Generated {len(surface_probes)} valid surface probe points.")

Step 3: Ray-Casting Scoring & Clustering

Not all surface space is a binding pocket. To determine how "deep" or "buried" a point is, the PocketDetector casts 30 uniformly distributed rays outward from each probe.

If a ray hits a protein atom, it counts as a hit. Probes with high hit counts (high "buriedness") are deep inside clefts. Deep probes are filtered and grouped into spatial clusters.

print("3. Scoring Buriedness and Clustering Deep Pockets...")

detector = PocketDetector(struct.coords)

# Calculate ray-casting scores for all surface probes
detector.score_probes(surface_probes)

# Filter probes that are deeply buried (Score > 18 out of 30)
deep_mask = detector.scores > 18
deep_probes = surface_probes[deep_mask]
deep_scores = detector.scores[deep_mask]

# Group neighboring deep probes into distinct clusters (Minimum volume: 40 ų)
clusters = detector.cluster_pockets(surface_probes)

print(f"Detected {len(clusters)} distinct binding pockets.")

Step 4: Pocket(s) Classification (SASA-Based)

Once the geometric boundaries of the pockets have been defined, their biochemistry will be studied.

The PocketClassifier utilizes freesasa to calculate the exposed surface area of the atoms lining the pocket. By combining this with RDKit's Gasteiger partial charges, one can calculate the percentage of unpolar, positive, and negative surface area.

Based on these established literature standards, pockets are classified regarding their biological and chemical roles. For instance, high localized percentages of negative or positive surface area dictate the binding behavior of specific co-factors, metal ions, or highly polar parts of larger ligands [2]. Furthermore, empirical NMR screening data highlights that the physical volume and hydrophobic properties of a cavity are the primary indicators of its druggability, as they are required to stabilize the aromatic rings common in pharmacology [4]. In this way, valid drug pockets must meet minimum volume thresholds (typically larger than 30-40 ų, with primary druggable clefts often exceeding 500 ų), while smaller or highly charged cavities tend to be related to the transient binding of ions or certain solvent molecules [4]. Ultimately, this study of spatial volume, surface area, and atomic partial charges tries to use the most important features used in recent methodologies [3].

print("4. Analyzing Surface Chemistry and Lining Residues...\n")

# Initialize the classifier (this automatically computes Gasteiger charges and SASA)
classifier = PocketClassifier(struct.mol)

# Analyze all geometric clusters
analyzed_pockets = classifier.analyze_clusters(deep_probes, deep_scores, clusters)

# Print out the detailed report for each pocket
for p in analyzed_pockets:
    print(f"=== POCKET {p.pocket_id} ===")
    print(f"  Classification:  {p.classification}")
    print(f"  Volume:          {p.volume} A^3")
    print(f"  Chemical Makeup: {p.p_unp:.1f}% Unpolar | {p.p_pos:.1f}% Positive | {p.p_neg:.1f}% Negative")
    print(f"  Lining Residues: {', '.join(p.lining_residues)}\n")

Step 5: Output Generation & PyMOL Visualization

The Exporter class maps the deep probes back into a standard PDB file format, cleverly storing the "buriedness score" in the B-factor column so we can apply color gradients.

It also automatically generates four distinct .pml scripts to view the protein surface, electrostatics, and pocket grids in PyMOL.

print("5. Exporting PDB and PyMOL Visualization Scripts...")

pdb_filename = "scored_pockets.pdb"
pdb_out_path = f"{output_dir}/{pdb_filename}"

# Save the synthetic pocket probes into a unified PDB file
Exporter.save_pdb(struct.mol, analyzed_pockets, pdb_out_path)

# Generate the 4 PyMOL visualization permutations
Exporter.write_pymol_scripts(output_dir, pdb_filename, len(analyzed_pockets))

print("Pipeline Complete! Check your output directory for the PyMOL scripts.")

Visualizing Results

Navigate to your output directory and open any of the generated scripts in PyMOL:

  1. 1_grey_surface_gradient_dots.pml: standard view, pocket depth shown via heat map.
  2. 2_grey_surface_discrete_dots.pml: standard view, pockets colored by unique ID.
  3. 3_electro_surface_gradient_dots.pml: protein colored by charge (red/blue), depth heat map.
  4. 4_electro_surface_discrete_dots.pml: protein colored by charge, pockets colored by ID.

1-Line Execution in the terminal

For batch validations (like the PocketPicker validation script), it is not needed to write step by step the steps above. One can utilize the orchestration function defined in __init__.py:

from pocket_finder import run_pipeline

# Runs steps 1-5 automatically and returns the list of PocketProperties dataclasses
pockets = run_pipeline(input_pdb="structure_files/pdb1bid.ent", output_dir="results_1bid")

References

[1] Weisel, M., Proschak, E., & Schneider, G. (2007). PocketPicker: Analysis of ligand binding-sites with shape descriptors. Chemistry Central Journal, 1(1), 7. https://doi.org/10.1186/1752-153X-1-7

[2] Dundas, J., Adamian, L., & Liang, J. (2011). Structural signatures of enzyme binding pockets from order-independent surface alignment: A study of metalloendopeptidase and NAD binding proteins. Journal of Molecular Biology, 406(5), 713–729. https://doi.org/10.1016/j.jmb.2010.12.005

[3] Wang, D. D., Wu, W., & Wang, R. (2024). Structure-based, deep-learning models for protein–ligand binding affinity prediction. Journal of Cheminformatics, 16, 2. https://doi.org/10.1186/s13321-023-00795-9

[4] Hajduk, P. J., Huth, J. R., & Fesik, S. W. (2005). Druggability indices for protein targets derived from NMR-based screening data. Journal of Medicinal Chemistry, 48(7), 2518–2525. https://doi.org/10.1021/jm049131r

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bspda-0.1.0.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bspda-0.1.0-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file bspda-0.1.0.tar.gz.

File metadata

  • Download URL: bspda-0.1.0.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for bspda-0.1.0.tar.gz
Algorithm Hash digest
SHA256 627841580407e087f92024a40118ae4f709710cf93eecfab4661ecf123fb665c
MD5 3b1b9183b75fa4cdc4b8c679f6bd3f94
BLAKE2b-256 e79b51027a89382ea0252f5a23fe94e38a2e33536889de582d6073deb1aebea7

See more details on using hashes here.

File details

Details for the file bspda-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bspda-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for bspda-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0243b9f8d35bce274a782d9c08cab03b421addd5ae5a8b06062ddbabc361cbc2
MD5 0ec95bfe2877f370454cefad2e77de50
BLAKE2b-256 f54cc3535a44e5eee9902d55edbddce576b18438bd8d47cb3484a323876b9f25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page