A 3D Protein Pocket Detection & Classification Pipeline using geometry and electrostatic methods
Project description
BSPDA: Binding-Site(s) Predictor by Diego and Alberto
Authors: Diego Maquieira Vidal and Alberto Domingo Gómez
Introduction
Welcome to BSPDA (our PYT-SBI project). The aim of this pipeline is to predict the position of binding sites within the 3D structure of a given protein model.
While the primary focus is geometric prediction (locating surface cavities where a ligand could bind), BSPDA also performs complex chemical analysis. Using Solvent Accessible Surface Area (SASA) and studying partial charges, the pipeline suggests the type of ligand involved and lists the specific amino acids lining the pocket.
PocketPicker tool was used as the main reference to build BSPDA [1].
This tutorial will walk you through the modular architecture of the pipeline.
Prerequisites
Before running this tutorial, ensure you have propperly installed the package.
pip install BSPDA
Step 1: Setup and Data Preparation
First, it is needed to download the target protein and process it. PDB files often contain "noise" for the methods of this pipeline, such as crystallized water molecules, buffer ions or existing ligands.
If hetero-atoms are not eliminated, the geometric algorithm will view the pockets as "already full" and fail to detect them.
print("1. Initializing and Cleaning Protein Structure...")
import BSPDA as bp
import Chem
import numpy as np
# Define output directory and target PDB ID
output_dir = "eval_results/1bid"
input_pdb = "1bid"
# Initialize the Structure object
struct = bp.ProteinStructure(pdb_id=input_pdb, work_dir=output_dir)
# After initialization, we can download its PDB (This will generate its RDKit Mol object as well)
struct.downloadPDB()
# Strip water, ions, and co-crystallized ligands to leave empty cavities
struct.strip_htm()
# Calculate Van der Waals radii for all remaining protein atoms
ptable = Chem.GetPeriodicTable()
vdw_radii = np.array([ptable.GetRvdw(atom.GetAtomicNum()) for atom in struct.mol.GetAtoms()])
print(f"Successfully loaded {pdb_target} with {struct.coords.shape[0]} heavy atoms.")
Step 2: 3D Grid Generation & Flood Fill
Before searching pockets, empty spaces need to be defined.
The GridGenerator creates a 3D bounding box around the protein and fills it with a voxel grid (1.0 Å resolution). It then measures the distance from every voxel to the nearest protein atom.
Using a 3D flood-fill algorithm, it distinguishes between the infinite bulk solvent outside the protein and the deep clefts/internal cavities.
print("2. Generating 3D Grid and Isolating Surface Probes...")
# Initialize grid generator with protein coordinates and VDW radii
grid_gen = bp.GridGenerator(struct.coords, vdw_radii)
# Apply flood-fill to find points that are empty, connected to the surface, but near the protein
surface_probes = grid_gen.get_valid_surface_points()
print(f"Generated {len(surface_probes)} valid surface probe points.")
Step 3: Ray-Casting Scoring & Clustering
Not all surface space is a binding pocket. To determine how "deep" or "buried" a point is, the PocketDetector casts 30 uniformly distributed rays outward from each probe.
If a ray hits a protein atom, it counts as a hit. Probes with high hit counts (high "buriedness") are deep inside clefts. Deep probes are filtered and grouped into spatial clusters.
print("3. Scoring Buriedness and Clustering Deep Pockets...")
detector = bp.PocketDetector(struct.coords)
# Calculate ray-casting scores for all surface probes
detector.score_probes(surface_probes)
# Filter probes that are deeply buried (Score > 18 out of 30)
# This step is not neccesary because because the threashold applied is 18 by default
deep_mask = detector.scores > 18
deep_probes = surface_probes[deep_mask]
deep_scores = detector.scores[deep_mask]
# Group neighboring deep probes into distinct clusters (Minimum volume: 40 ų)
clusters = detector.cluster_pockets(surface_probes)
print(f"Detected {len(clusters)} distinct binding pockets.")
Step 4: Pocket(s) Classification (SASA-Based)
Once the geometric boundaries of the pockets have been defined, their biochemistry will be studied.
The PocketClassifier utilizes freesasa to calculate the exposed surface area of the atoms lining the pocket. By combining this with RDKit's Gasteiger partial charges, one can calculate the percentage of unpolar, positive, and negative surface area.
Based on these established literature standards, pockets are classified regarding their biological and chemical roles. For instance, high localized percentages of negative or positive surface area dictate the binding behavior of specific co-factors, metal ions, or highly polar parts of larger ligands [2]. Furthermore, empirical NMR screening data highlights that the physical volume and hydrophobic properties of a cavity are the primary indicators of its druggability, as they are required to stabilize the aromatic rings common in pharmacology [4]. In this way, valid drug pockets must meet minimum volume thresholds (typically larger than 30-40 ų, with primary druggable clefts often exceeding 500 ų), while smaller or highly charged cavities tend to be related to the transient binding of ions or certain solvent molecules [4]. Ultimately, this study of spatial volume, surface area, and atomic partial charges tries to use the most important features used in recent methodologies [3].
print("4. Analyzing Surface Chemistry and Lining Residues...\n")
# Initialize the classifier (this automatically computes Gasteiger charges and SASA)
classifier = bp.PocketClassifier(struct.mol)
# Analyze all geometric clusters
analyzed_pockets = classifier.analyze_clusters(deep_probes, deep_scores, clusters)
# Print out the detailed report for each pocket
for p in analyzed_pockets:
print(f"=== POCKET {p.pocket_id} ===")
print(f" Classification: {p.classification}")
print(f" Volume: {p.volume} A^3")
print(f" Chemical Makeup: {p.p_unp:.1f}% Unpolar | {p.p_pos:.1f}% Positive | {p.p_neg:.1f}% Negative")
print(f" Lining Residues: {', '.join(p.lining_residues)}\n")
Step 5: Output Generation & PyMOL Visualization
The Exporter class maps the deep probes back into a standard PDB file format, cleverly storing the "buriedness score" in the B-factor column so we can apply color gradients.
It also automatically generates four distinct .pml scripts to view the protein surface, electrostatics, and pocket grids in PyMOL.
import os
print("5. Exporting PDB and PyMOL Visualization Scripts...")
pdb_filename = f"scored_pockets_{input_pdb}.pdb"
pdb_out_path = os.path.join(output_dir, pdb_filename)
# Save the synthetic pocket probes into a unified PDB file
bp.Exporter.save_pdb(struct.mol, analyzed_pockets, pdb_out_path)
# Generate the 4 scenes PyMOL visualization script
Exporter.write_pymol_script(output_dir, pdb_filename, len(analyzed_pockets), input_pdb)
print("Pipeline Complete! Check your output directory for the PyMOL scripts.")
Visualizing Results
Navigate to your output directory and open the generated PyMol script:
F1 grey_surface_gradient: standard view, pocket depth shown via heat map.F2 grey_surface_discrete: standard view, pockets colored by unique ID.F3 electro_surface_gradient: protein colored by charge (red/blue), depth heat map.F4 electro_surface_discrete: protein colored by charge, pockets colored by ID.
pymol render_pockets_1BID.pml
1-Programatic use
For batch validations (like the PocketPicker validation script), it is not needed to write step by step the steps above. One can utilize the orchestration functions defined in __init__.py, for example, a list of PDB ids can be provided.
import BSPDA
pbd_list = ["1BID","1BIB"]
for id, pockets in BSPDA.get_multiple_protein_pockets(pdb_list).items():
for p in pockets:
print(f" [POCKET {id} {p.pocket_id}] {p.classification}")
print(f" Vol: {p.volume} A^3 | Unp: {p.p_unp:.1f}% | Pos: {p.p_pos:.1f}% | Neg: {p.p_neg:.1f}%")
2-Command line use
This program can be run with a single PDB id in bash, though, an output dir can be added using -o name
find-pockets 1BID
# Or
find-pockets 1BID -o out
2-Command line use
This program can be run with a single PDB id in bash, though, an output dir can be added using -o name
find-pockets 1BID
# Or
find-pockets 1BID -o out
Step 6: Pipeline Validation
To prove the efficacy and robustness of the BSPDA pipeline, its predictions against two structural datasets were predicted: - Large Organic Ligands: both Bound (Holo) and Unbound (Apo) protein conformations were chosen to test the algorithm's resilience to induced-fit structural changes. This dataset was the same used by Weisel et al. 2007 [1]. - Metal & Inorganic Ions: evaluated across both Cations (positive) and Anions (negative) to prove the accuracy of our Gasteiger partial-charge chemical classification.
The following script loads the pre-computed prediction results, calculates top-ranking success rates, and generates a visual performance breakdown.
Evaluation Methodology: To mathematically determine if a predicted pocket successfully identified the true binding site, coordinates of the co-crystallized ligand/ion were extracted from the reference PDB file. We then calculated the minimum Euclidean distance between the centroid of each predicted pocket and the target atoms. A prediction was classified as a "Hit" (SUCCESS) if this distance was $\le$ 4.0 Å (following the method used by Weisel et al. 2007) [1].
Furthermore, the predicted pockets were ranked sequentially by their geometric volume (where Rank 1 represents the largest cavity). By tracking the Top-1 and Top-3 success rates, it is evaluated not only if the algorithm found the pocket, but how accurately it prioritized the true binding site over random surface indentations. In the case of ions, it is expected that they are not binded to the largest pockest.
The following script loads the pre-computed prediction results, calculates these top-ranking success rates, and generates a visual performance breakdown.
import import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# 1. Load and clean the data
ligands_df = pd.read_csv("validation_data/bspda_validation_results.csv")
ions_df = pd.read_csv("validation_data/bspda_validation_ions_results.csv")
# Convert Hit_Rank to numeric
ligands_df['Hit_Rank'] = pd.to_numeric(ligands_df['Hit_Rank'], errors='coerce')
ions_df['Hit_Rank'] = pd.to_numeric(ions_df['Hit_Rank'], errors='coerce')
# --- Categorize the Ions Dynamically ---
def categorize_ion_by_name(ion_name):
name = str(ion_name).upper()
if any(suffix in name for suffix in ['IDE', 'ATE', 'ITE']):
return 'Anion'
return 'Cation'
ions_df['Ion_Category'] = ions_df['Target_Ion_Name'].apply(categorize_ion_by_name)
# Rename 'Complex' to 'Bound (Complex)' for prettier charts
ligands_df['Type'] = ligands_df['Type'].replace({'Complex': 'Bound (Complex)'})
# Split data for plotting successful hits
succ_lig = ligands_df[ligands_df['Success'] == 'SUCCESS']
succ_ion = ions_df[ions_df['Success'] == 'SUCCESS']
# 2. Calculate and print text statistics
def print_detailed_stats(title, df, split_col):
print(f"{'='*50}")
print(f" {title.upper()} STATISTICS")
print(f"{'='*50}")
# Iterate through the sub-categories (e.g., Bound vs Unbound)
for category in df[split_col].unique():
cat_df = df[df[split_col] == category]
total = len(cat_df)
if total == 0: continue
succ_df = cat_df[cat_df['Success'] == 'SUCCESS']
fail_df = cat_df[cat_df['Success'] == 'FAILED']
print(f"--- [ {category.upper()} ] ---")
print(f"Total Evaluated : {total}")
print(f"Total Hits (SUCCESS) : {len(succ_df)} ({(len(succ_df)/total)*100:.1f}%)")
print(f"Total Misses (FAILED) : {len(fail_df)} ({(len(fail_df)/total)*100:.1f}%)\n")
top_1 = len(succ_df[succ_df['Hit_Rank'] == 1])
top_3 = len(succ_df[succ_df['Hit_Rank'] <= 3])
print(f"Top-1 Success Rate : {top_1} ({(top_1/total)*100:.1f}%)")
print(f"Top-3 Success Rate : {top_3} ({(top_3/total)*100:.1f}%)\n")
print(f"Predicted Pocket Types (Successful Hits):")
type_counts = succ_df['Predicted_Type'].value_counts()
for p_type, count in type_counts.items():
if p_type != "N/A":
print(f" -> {p_type}: {count} pockets")
print("\n")
# Print the broken-down stats
print_detailed_stats("Large Ligands", ligands_df, "Type")
print_detailed_stats("Metal & Inorganic Ions", ions_df, "Ion_Category")
# 3. Generate visualizations (2x3 Grid)
sns.set_theme(style="whitegrid")
fig, axes = plt.subplots(2, 3, figsize=(22, 13))
fig.suptitle("BSPDA Pipeline Validation: Detailed Structural & Chemical Breakdown", fontsize=18, fontweight='bold', y=0.98)
# Color Palettes
ligand_palette = {'Bound (Complex)': '#3498db', 'Unbound': '#e74c3c'} # Blue / Red
ion_palette = {'Cation': '#9b59b6', 'Anion': '#2ecc71'} # Purple / Green
# --- ROW 1: ORGANIC LIGANDS (Bound vs Unbound) ---
sns.countplot(data=ligands_df, x='Success', hue='Type', palette=ligand_palette, ax=axes[0, 0], order=['SUCCESS', 'FAILED'])
axes[0, 0].set_title('Ligands: Success by Conformation', fontsize=13, fontweight='bold')
axes[0, 0].set_ylabel('Number of Targets')
if not succ_lig.empty:
sns.histplot(data=succ_lig, x='Hit_Rank', hue='Type', palette=ligand_palette, multiple='dodge', discrete=True, ax=axes[0, 1])
axes[0, 1].set_xticks(range(1, int(succ_lig['Hit_Rank'].max()) + 2))
axes[0, 1].set_title('Ligands: Rank Distributions', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('Predicted Rank (1 = Largest Pocket)')
if not succ_lig.empty and 'Predicted_Type' in succ_lig.columns:
sns.countplot(data=succ_lig, y='Predicted_Type', hue='Type', palette=ligand_palette, ax=axes[0, 2], order=succ_lig['Predicted_Type'].value_counts().index)
axes[0, 2].set_title('Ligands: Chemical Classification', fontsize=13, fontweight='bold')
axes[0, 2].set_ylabel('')
# --- ROW 2: METAL IONS (Cations vs Anions) ---
sns.countplot(data=ions_df, x='Success', hue='Ion_Category', palette=ion_palette, ax=axes[1, 0], order=['SUCCESS', 'FAILED'])
axes[1, 0].set_title('Ions: Success by Charge', fontsize=13, fontweight='bold')
axes[1, 0].set_ylabel('Number of Targets')
if not succ_ion.empty:
sns.histplot(data=succ_ion, x='Hit_Rank', hue='Ion_Category', palette=ion_palette, multiple='dodge', discrete=True, ax=axes[1, 1])
axes[1, 1].set_xticks(range(1, int(succ_ion['Hit_Rank'].max()) + 2))
axes[1, 1].set_title('Ions: Rank Distributions', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('Predicted Rank (1 = Largest Pocket)')
if not succ_ion.empty and 'Predicted_Type' in succ_ion.columns:
clean_succ_ion = succ_ion[succ_ion['Predicted_Type'] != 'N/A']
if not clean_succ_ion.empty:
sns.countplot(data=clean_succ_ion, y='Predicted_Type', hue='Ion_Category', palette=ion_palette, ax=axes[1, 2], order=clean_succ_ion['Predicted_Type'].value_counts().index)
axes[1, 2].set_title('Ions: Chemical Classification', fontsize=13, fontweight='bold')
axes[1, 2].set_ylabel('')
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.savefig("validation_data/pipeline_validation.png") # Make sure to save it!
plt.show()
Binding Site Prediction Results
For the large ligands, the algorithm successfully identifies the correct binding sites, achieving an 80.0% and 75.6% success rate for bound (holo) and unbound (apo) structures, respectively. These metrics closely mirror the performance of the original PocketPicker method, confirming the accurate recreation of its core algorithm [1]. When the algorithm fails to predict a binding site, it is largely due to the geometric nature of the target. For instance, in targets like 1IVD and 1FBP, the ligand binds to a very shallow surface depression that fails to reach the required buriedness threshold. A similar issue occurs in 3MTH, which exemplifies the effect of overall protein size on prediction accuracy: because 3MTH is a very small protein, a large portion of its ligand remains exposed to the solvent, yielding a low buriedness score. Furthermore, the size distribution of the predicted pockets demonstrates that the correct binding site is almost always ranked as the largest cleft on the protein, aligning perfectly with established literature [2]. However, the chemical classification step underperformed. The strict thresholds provided to the algorithm failed to properly categorize the binding sites, with only 25% and 20% of the bound and unbound targets, respectively, classified as "Primary Binding Sites". This is likely due to the volume and surface area limits being set too high; just as with the buriedness metric, many highly functional pockets do not need to be massive to facilitate proper ligand binding.
For metal and inorganic ions, the current approach proved highly limited. While the initial expectation was that the algorithm could cleanly separate cations and anions into discrete electrostatic anchor sites, the pipeline yielded very poor predictive results with only two successful hits. This failure stems primarily from the buriedness metric, which is highly problematic to apply to ions. Ions frequently bind to extremely shallow, solvent-exposed surface patches that fail to register as deep cavities. Conversely, some ions rely on binding sites deeply embedded within the inaccessible core of the protein making them virtually invisible to a standard surface-scanning probe. Furthermore, the percentage-based charge classifier proved ineffective, as it overwhelmingly defaulted to classifying the pockets as cation sites. Despite these failures, a positive observation from the two successful hits is that the algorithm accurately assessed their volume; rather than predicting them as massive clefts, it correctly ranked them among the smallest pockets on the protein.
Conclusions and Future Perspectives
In conclusion, while this pipeline is highly capable of detecting large ligand binding sites, it is currently not viable for ion prediction. The reliance on purely geometric buriedness limits the detection of shallow sites, a limitation previously noted in the original PocketPicker study. This is probably due to the complexities of multi-subunit protein interfaces and shallow surface clefts.
A highly promising point from this exercise is the algorithm's strong ability to identify pockets in unbound (apo) structures. Future reformulations of this algorithm should investigate these apo results to determine if the successfully (or unsuccessfully) identified targets share specific characteristics, such as distinct catalytic mechanisms, allosteric binding actions, or specific ligand functions.
Moving forward, several improvements and limitations must be considered for the pipeline: - Focus and Energetics: the algorithm should remain focused on large ligands. Future iterations should move beyond pure geometry by incorporating energetic perspectives, such as protein frustration analysis and rotamer library sampling, to better understand pocket flexibility. - Chemical Classification: instead of relying on strict percentage-based surface charges, the classifier should be updated to also analyze and consider the specific amino acid composition of the pocket to determine its biological type, mainly based on bibliographical information or frequencies derived from a given dataset. - Conformational Ensembles: to account for protein flexibility, the pipeline should evaluate multiple structural conformations (i.e.: ensembles) of the same protein rather than relying on a single static snapshot. - Nucleic Acid Limitations: currently, RDKit does not handle DNA and RNA ideally in this context, often failing to separate nucleic acids from the protein structure. Consequently, predicting pockets on transcription factors or DNA-binding proteins is not recommended for the current build and requires specific parsing improvements. - Pocket Splitting: a known limitation of grid-based clustering is that large, continuous pockets can be artifactually split into multiple smaller sub-pockets if there is a spatial bottleneck, as observed with UMP at 1BID structure.
To resolve the issues of shallow pockets and artificial splitting, we propose implementing a dynamic parameter in future versions. By evaluating a target across a wide range of buriedness limits and scanning for spatial convergence between the resulting grids, the algorithm could generate a much more robust and unified definition of the binding pockets.
References
[1] Weisel, M., Proschak, E., & Schneider, G. (2007). PocketPicker: Analysis of ligand binding-sites with shape descriptors. Chemistry Central Journal, 1(1), 7. https://doi.org/10.1186/1752-153X-1-7
[2] Dundas, J., Adamian, L., & Liang, J. (2011). Structural signatures of enzyme binding pockets from order-independent surface alignment: A study of metalloendopeptidase and NAD binding proteins. Journal of Molecular Biology, 406(5), 713–729. https://doi.org/10.1016/j.jmb.2010.12.005
[3] Wang, D. D., Wu, W., & Wang, R. (2024). Structure-based, deep-learning models for protein–ligand binding affinity prediction. Journal of Cheminformatics, 16, 2. https://doi.org/10.1186/s13321-023-00795-9
[4] Hajduk, P. J., Huth, J. R., & Fesik, S. W. (2005). Druggability indices for protein targets derived from NMR-based screening data. Journal of Medicinal Chemistry, 48(7), 2518–2525. https://doi.org/10.1021/jm049131r
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bspda-0.3.0.tar.gz.
File metadata
- Download URL: bspda-0.3.0.tar.gz
- Upload date:
- Size: 35.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76bc15bd8f33f6e1472855cb865e6696ac45ef0ff8ea19ff38d24897e63e787e
|
|
| MD5 |
fcae125c43c6ad0173bc82458d446c3d
|
|
| BLAKE2b-256 |
8f1b6906a32a6ad0c7b2a416bbdf1b9e7a5c7172b7d01eeaebd8431ef65014b5
|
File details
Details for the file bspda-0.3.0-py3-none-any.whl.
File metadata
- Download URL: bspda-0.3.0-py3-none-any.whl
- Upload date:
- Size: 30.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7992bfeee748d8272a509c83504b608f629ea53b9e2a330e86cab80fcaefec55
|
|
| MD5 |
ab8af9bac2dbec732f20e7e82a479624
|
|
| BLAKE2b-256 |
5bcdaf2eff3cd1efe2845f0dc26be79ab4e4d6a8123eb491ed352ff963e7a83e
|