Skip to main content

Inverse virtual screening — dock one ligand against a whole protein library via GNINA.

Project description

WTFDTB — High-Throughput Inverse Virtual Screening

Target Fishing: Dock a single small-molecule ligand against a library of macromolecular protein structures using a state-of-the-art ML/DL stack.

Python 3.10+ License: MIT Status: Alpha


What Is This?

Traditional virtual screening docks many ligands against one protein target. WTFDTB flips this: it docks one ligand against many proteins to answer the question — "What targets does this drug bind?"

This is called inverse virtual screening (or target fishing), and it's essential for:

  • Drug repurposing — finding new uses for existing drugs
  • Off-target prediction — identifying potential side effects
  • Polypharmacology — understanding multi-target drug activity
  • Natural product target deconvolution — identifying targets for bioactive compounds

WTFDTB automates the entire workflow from a raw ligand file to a ranked CSV of protein targets with interaction fingerprints — no manual intervention needed.


Pipeline Architecture

The pipeline runs in 5 sequential phases:

  ┌──────────────┐    ┌────────────────────┐    ┌──────────────────┐
  │  1. Ligand   │───▶│  2. Receptor       │───▶│  3. Pocket       │
  │     Prep     │    │     Curation       │    │     Detection    │
  │              │    │     (parallel)      │    │                  │
  │ Dimorphite-DL│    │ PDBFixer + PDB2PQR │    │     P2Rank       │
  │ RDKit + Meeko│    │ + PROPKA + Meeko   │    │     (Java ML)    │
  └──────────────┘    └────────────────────┘    └──────────────────┘
                                                         │
         ┌───────────────────────────────────────────────┘
         ▼
  ┌──────────────────┐    ┌──────────────────────┐
  │  4. Docking      │───▶│  5. Post-Docking      │
  │     (parallel)   │    │     Analysis           │
  │                  │    │                        │
  │     GNINA        │    │ ProLIF + Pandas        │
  │  (CNN-rescored)  │    │ Filter → Rank → CSV   │
  └──────────────────┘    └──────────────────────┘

Phase Details

Phase Module Tools What It Does
1. Ligand Prep ligand_prep.py Dimorphite-DL, RDKit, Meeko Enumerate protonation states at target pH, generate 3D conformer (ETKDGv3 + MMFF94), produce PDBQT with Gasteiger charges
2. Receptor Curation receptor_curation.py PDBFixer, PDB2PQR, PROPKA, pdb-tools Download PDB from RCSB, strip HETATM/water, repair missing heavy atoms, protonate at target pH, parallelised across all targets
3. Pocket Detection pocket_detection.py P2Rank (Java) ML-based druggable pocket prediction — no template bias, detects all possible binding sites per protein
4. Docking docking.py GNINA (C++) CNN-rescored molecular docking for each pocket × ligand combination, parallelised with ProcessPoolExecutor
5. Post-Docking post_dock.py ProLIF, Pandas Compute interaction fingerprints (H-bond, hydrophobic, π-stacking, salt bridge), apply CNNscore filter, rank by CNNaffinity, export CSV

Installation

Option A: Conda / Mamba (Recommended)

# Create environment with all dependencies including GNINA and Java
mamba create -n wtfdtb python=3.12
mamba activate wtfdtb
pip install -e .

Option B: From Source (Development)

git clone https://github.com/ChandraguptSharma07/WTFDTB.git
cd WTFDTB
python -m venv .venv
source .venv/bin/activate    # Linux/macOS
pip install -e ".[dev]"

External Dependencies

These binaries must be available on PATH:

Tool Purpose Install
GNINA CNN-rescored docking engine github.com/gnina/gnina or mamba install gnina
P2Rank ML pocket detection github.com/rdk/p2rank — requires Java ≥ 11
Java ≥ 11 Required by P2Rank mamba install openjdk

Set PRANK_HOME to the P2Rank installation directory if it's not on your PATH:

export PRANK_HOME=/path/to/p2rank_2.4.2

Quick Start

Basic Usage

# Screen aspirin against 3 known kinase targets
echo "1EQG
2HZI
3K5V" > targets.txt

wtfdtb screen \
  --ligand aspirin.sdf \
  --targets targets.txt \
  --output results.csv

Using PDB IDs from a Text File

# targets.txt — one PDB ID per line
wtfdtb screen \
  --ligand my_compound.smi \
  --targets targets.txt \
  --output hits.csv \
  --ph 7.4 \
  --exhaustiveness 8 \
  --workers 4

Using a Directory of PDB Files

# Directory containing .pdb files
wtfdtb screen \
  --ligand drug.sdf \
  --targets ./protein_library/ \
  --output results.csv

SMILES Input

The ligand can be a .smi file with SMILES notation:

echo "CC(=O)Oc1ccccc1C(=O)O aspirin" > aspirin.smi
wtfdtb screen --ligand aspirin.smi --targets targets.txt -o results.csv

CLI Reference

wtfdtb screen [OPTIONS]
Flag Type Default Description
--ligand, -l Path required Input ligand file (.sdf, .mol, .mol2, .smi)
--targets, -t Path required Protein target library — directory of .pdb files or text file of PDB IDs
--output, -o Path results.csv Output CSV path for ranked docking results
--ph float 7.4 Physiological pH for ligand and receptor protonation
--box-size int 25 Side length (Å) of the cubic docking search box
--cnn-model str default GNINA CNN model (default, dense, or path to weights)
--cnn-score-threshold float 0.5 Minimum CNNscore (0–1) to accept a pose
--min-interactions int 1 Minimum protein-ligand interactions to keep a pose (0 = no filter)
--workers, -w int CPU count Parallel workers for receptor curation and docking
--exhaustiveness int 8 GNINA search exhaustiveness (higher = slower, more thorough)
--verbosity int 1 Logging: 0 = quiet, 1 = normal, 2 = debug
--version, -v Show version and exit

Output Format

The output CSV is primarily ranked by Vina affinity (ascending = tighter predicted binding in kcal/mol), with CNNaffinity (pKd) used to break ties:

Column Description
rank Overall rank (1 = best predicted binder)
pdb_id Target protein PDB ID
pocket Binding pocket name (from P2Rank)
pose_rank Pose rank within this pocket (from GNINA)
cnn_score GNINA CNN confidence score (0–1, higher = more accurate pose)
cnn_affinity GNINA CNN-predicted binding affinity (pKd, higher = tighter)
vina_affinity AutoDock Vina scoring function affinity (kcal/mol, lower = tighter)
hbond Number of hydrogen bonds (donor + acceptor)
hydrophobic Number of hydrophobic contacts
pi_stacking Number of π-stacking / cation-π interactions
salt_bridge Number of salt bridges (anionic + cationic)
total_interactions Sum of all interaction types

Example output:

rank,pdb_id,pocket,pose_rank,cnn_score,cnn_affinity,vina_affinity,hbond,hydrophobic,pi_stacking,salt_bridge,total_interactions
1,1EQG,pocket3,1,0.89,-7.2,-6.5,3,4,1,0,8
2,2HZI,pocket1,2,0.76,-6.8,-5.9,2,3,0,1,6
3,1EQG,pocket7,1,0.82,-6.5,-6.1,2,2,1,0,5

Project Structure

WTFDTB/
├── pyproject.toml               # Package metadata, dependencies, entry point
├── recipe/
│   └── meta.yaml                # Bioconda / Conda-Forge recipe
├── src/
│   └── wtfdtb/
│       ├── __init__.py           # Version string
│       ├── cli.py                # Typer CLI — screen command + all flags
│       ├── ligand_prep.py        # Phase 1: SMILES/SDF → protonated 3D PDBQT
│       ├── receptor_curation.py  # Phase 2: PDB → cleaned, protonated receptor
│       ├── pocket_detection.py   # Phase 3: P2Rank ML pocket prediction
│       ├── docking.py            # Phase 4: GNINA CNN-rescored docking
│       ├── post_dock.py          # Phase 5: ProLIF interactions + ranking
│       ├── pipeline.py           # Orchestrator: wires Phases 1–5
│       └── utils.py              # PDB fetcher, logging, shared helpers
├── tests/
│   └── ...
└── README.md

Tech Stack

Layer Tool Purpose
CLI Typer Type-hinted CLI with auto-generated --help
Ligand Protonation Dimorphite-DL pH-dependent protonation state enumeration
Cheminformatics RDKit 3D conformer generation (ETKDGv3), MMFF94 minimisation
PDBQT Generation Meeko Gasteiger charges, torsion tree for AutoDock-family
PDB Parsing Biopython REMARK 465 parsing for quality gating
PDB Cleaning pdb-tools Strip HETATM, waters, alternate conformations
Structure Repair PDBFixer (OpenMM) Model missing heavy atoms
Receptor Protonation PDB2PQR + PROPKA Rigorous pKa-based protonation
Pocket Detection P2Rank ML-based pocket prediction (Java)
Docking GNINA CNN-rescored docking (superior to AutoDock Vina)
Interaction Fingerprints ProLIF H-bond, hydrophobic, π-stacking, salt bridge detection
Data Pandas Filtering, ranking, CSV export
Parallelism concurrent.futures ProcessPoolExecutor for receptors + docking

How It Works (In Detail)

Phase 1: Ligand Preparation

  1. Read input ligand (SMILES string or SDF/MOL file)
  2. Enumerate physiological protonation states at the target pH using Dimorphite-DL
  3. Generate 3D coordinates using RDKit's ETKDGv3 algorithm
  4. Energy-minimise with the MMFF94 force field
  5. Convert to PDBQT format (Gasteiger charges + torsion tree) via Meeko

Phase 2: Receptor Curation

For each protein target (downloaded from RCSB or provided as local PDB):

  1. Strip all HETATM records and water molecules using pdb-tools
  2. Repair missing heavy atoms using PDBFixer (OpenMM)
  3. Assign protonation states at physiological pH using PDB2PQR with PROPKA
  4. Write the curated receptor PDB

This phase runs in parallel across all targets using ProcessPoolExecutor.

Phase 3: Pocket Detection

  1. Run P2Rank on all curated receptors in batch mode
  2. Parse P2Rank output to extract binding pocket centers (X, Y, Z coordinates)
  3. Each pocket defines a docking search box for Phase 4

P2Rank uses machine learning (random forests on surface features) to detect druggable pockets without requiring known binding site templates.

Phase 4: Molecular Docking

For each (receptor, pocket) combination:

  1. Build GNINA command-line arguments with pocket center and box size
  2. Run GNINA with CNN rescoring enabled
  3. Parse output SDF to extract per-pose CNNscore, CNNaffinity, and Vina affinity

This phase runs in parallel using ProcessPoolExecutor. GNINA uses convolutional neural networks trained on protein-ligand complexes to rescore docking poses, significantly outperforming classical scoring functions.

Phase 5: Post-Docking Analysis

  1. CNNscore filter: Discard poses below the threshold (default 0.5)
  2. Interaction profiling: Use ProLIF to compute protein-ligand interaction fingerprints (H-bonds, hydrophobic contacts, π-stacking, salt bridges, cation-π)
  3. Interaction filter: Discard poses with fewer interactions than --min-interactions
  4. Ranking: Sort remaining poses by Vina affinity (kcal/mol, ascending) then CNN affinity (pKd, descending)
  5. Export: Write ranked results to CSV

Development

Setup

git clone https://github.com/ChandraguptSharma07/WTFDTB.git
cd WTFDTB
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Running Tests

pytest

Code Quality

ruff check src/
ruff format src/

Building the Conda Package

conda build recipe/

Supported Platforms

Platform Status Notes
Linux x86_64 ✅ Supported Primary platform. GNINA binary available via conda-forge.
macOS ⚠️ Partial Python pipeline works; GNINA must be compiled from source.
Windows (WSL) ⚠️ Partial Works through Windows Subsystem for Linux.

Citation

If you use WTFDTB in your research, please cite:

@software{wtfdtb2025,
  title  = {WTFDTB: High-Throughput Inverse Virtual Screening},
  author = {Chandragupt Sharma},
  year   = {2025},
  url    = {https://github.com/ChandraguptSharma07/WTFDTB}
}

And the key tools in the pipeline:

  • GNINA: McNutt et al. J. Cheminformatics 13, 43 (2021)
  • P2Rank: Krivák & Hoksza. J. Cheminformatics 10, 39 (2018)
  • ProLIF: Bouysset & Fiorucci. J. Cheminformatics 13, 72 (2021)
  • RDKit: rdkit.org

License

MIT — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wtfdtb-0.1.1.tar.gz (597.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wtfdtb-0.1.1-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file wtfdtb-0.1.1.tar.gz.

File metadata

  • Download URL: wtfdtb-0.1.1.tar.gz
  • Upload date:
  • Size: 597.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for wtfdtb-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3280919817f0d912a0f1750fa1309e225c39a782d96b0b954ddfaabf0ddfe0a8
MD5 93de48fa622c6ddd58e0cf4c834cac2c
BLAKE2b-256 3352bcbe2eaef936a963232e4c4c80bbad709604720ed2a54fc13c9edb7303ae

See more details on using hashes here.

File details

Details for the file wtfdtb-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: wtfdtb-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for wtfdtb-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 459eb9b9167d0e0c9e1b9f0efdf5182c349e9099f38a305b8f2c86d0e68c059c
MD5 1e541c5b64ac2137f0f9ebc6e4747adb
BLAKE2b-256 b06cdaf6dd420539a9021b9ee1f5cd82fa9f1584acc1a4a1477b580eee713bba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page