End-to-End AI Drug Discovery Pipeline powered by DrugCLIP
Project description
BioTarget: End-to-End AI Drug Discovery Pipeline 🧬💊
BioTarget is a state-of-the-art, open-source CLI pipeline designed to accelerate the early stages of the AI drug-discovery workflow. It seamlessly links target discovery, 3D protein structure prediction, deep-learning-based contrastive molecular screening, and physics-based CNN docking into a single cohesive framework.
The pipeline leverages DrugCLIP (a dual-encoder graph-text architecture) to act as a generative filter for toxicity and therapeutic intent, and gnina for structure-aware binding affinity predictions.
🎯 The Pipeline Architecture
BioTarget executes a 5-stage workflow designed for rapid, iterative drug discovery:
1. Stage A: Disease $\rightarrow$ Target Ranking
Retrieves and ranks disease-relevant protein targets by querying extensive biomedical knowledge graphs.
- Sources: Open Targets Platform, DisGeNET, STRING, Reactome.
- Methodology: Ranks protein targets via heterogeneous Graph Neural Networks (GNN) and biological pathway evidence mapping.
2. Stage B: Protein Structure Generation
Fetches or predicts the 3D conformation of the selected target proteins.
- Primary: Experimental structures (PDB).
- Generator: OpenFold-3 for de novo prediction of variants, mutants, or unmapped isoforms.
3. Stage C: Generative AI & Candidate Extraction
Instead of blindly docking massive lookup libraries (like ChEMBL), BioTarget employs a highly optimized generative filtering approach.
- DrugCLIP Guidance: Thousands of virtual compounds are geometrically folded on the CPU array.
DrugCLIPencodes a textual representation of the disease and isolates the Top 10× geometrically/semantically aligned molecular structures.
4. Stage D: Multi-Objective Binding & Toxicity Evaluation
Evaluates candidates simultaneously for efficacy (physics/CNN docking) and safety (latent space contrastive geometry).
- Binding Evaluation (
gnina): Generates 3D structural Spatial Data Files (.sdf) via RDKit and calls the actualgninasubprocess. Evaluates ligand-receptor binding affinity using Convolutional Neural Networks on voxelized binding sites. - Toxicity Penalty (
DrugCLIP): Computes semantic embedding for clinical failure and calculates the normalized Cosine Similarity against the ligand's structural embedding.
5. Stage E: Ranking & Reporting
- Final Ranking: $\mathcal{S}{final} = \mathcal{S}{binding} - (0.5 \cdot \mathcal{S}_{tox})$ Aggregates hits, flags highly toxic compounds (⚠️), and outputs a ranked manifest of candidate SMILES ready for Molecular Dynamics (MD) refinement via OpenMM.
🚀 Installation & Setup
BioTarget operates as the primary orchestration CLI and relies on the standalone drugclip package for multi-modal embedding.
# Clone the repository
git clone https://github.com/your-org/biotarget.git
cd biotarget
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies (requires drugclip pip package)
pip install -r requirements.txt
pip install git+https://github.com/your-org/drugclip.git
Note: For Stage D to execute its highest-accuracy CNN scoring, ensure the gnina binary is compiled and accessible in your system $PATH.
🔬 Running the BioTarget Pipeline
The pipeline is invoked via the unified biotarget.py orchestrator.
To execute the end-to-end pipeline for a specific disease:
python biotarget.py run full \
--disease "Alzheimer" \
--target-model hetero-gnn \
--structure-engine openfold3 \
--binding-engine gnina \
--top-targets 3 \
--top-ligands 10
Example Output
[Stage A] Disease -> Protein Target Ranking
[*] Querying Open Targets & DisGeNET for 'Alzheimer'...
[*] Found 3 highly ranked targets.
[Stage B] Protein Structure Generation
[*] Using engine: openfold3
[*] Folding GBA (P04062) with OpenFold-3...
[Stage C] Generative AI: De Novo Candidate Generation
[*] Generating 3000 de novo molecular structures...
[*] Generating 3D conformers for the generative pool using 64 CPU cores...
[*] Using DrugCLIP to guide selection of the top 100 generated candidates...
[*] Successfully finalized 10x generative candidate pool (N=100).
[Stage D] Binding Evaluation (gnina) & Toxicity Filtering (DrugCLIP)
[*] Loaded Target Receptor: GBA from Stage B (/runs/structures/GBA_openfold3.pdb)
[*] Computing Toxicity penalties for 100 candidates via DrugCLIP...
[*] Executing 'gnina' structure-aware docking & CNN scoring on 100 candidates...
[Stage E] Reporting
=====================================================================================
BIOTARGET PIPELINE FINAL RESULTS FOR: 'Alzheimer'
=====================================================================================
Rank | Final | Gnina (pK_d) | Tox Penalty | SMILES
-------------------------------------------------------------------------------------
#1 | 0.9944 | 9.4457 (0.99) | 0.0000 OK | CCC1(C(C)(C)C)CCOC1=O...
#2 | 0.8108 | 8.9903 (0.91) | 0.2005 OK | COc1ccccc1N=C(S)N(CCN1CCOCC1)Cc1ccc...
#3 | 0.7631 | 9.2345 (0.96) | 0.3852 OK | CCOC(=O)C1CCCN(c2c(NCCCN(C)Cc3ccccc...
#4 | 0.5101 | 8.8713 (0.87) | 0.7225 ⚠️ HIGH | CCCC(N=C(S)NCC1CCCO1)C12CC3CC(CC(C3...
🛠 Model Extensibility (The Roadmap)
While this framework establishes the AI-driven core, it is intentionally modular to support the integration of downstream biophysics tools:
- Generative Expansion: Swapping the simulated candidate subset for an active autoregressive/diffusion generative model to perform closed-loop optimization.
- MD Refinement: Automated hand-off of the top $K$ hits to OpenMM for physical stability analysis and short MD relaxation.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biotarget-0.1.0.tar.gz.
File metadata
- Download URL: biotarget-0.1.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ed9ea95ca9e3dd951784c9f9008b36db8b8019814cdf945d9caa2f811c088fd
|
|
| MD5 |
5c15cbf327b86c0605cddd339cade62e
|
|
| BLAKE2b-256 |
a3869a2307402001a70134eac569db8e154fce7e51bd6d41ea21038898612880
|
File details
Details for the file biotarget-0.1.0-py3-none-any.whl.
File metadata
- Download URL: biotarget-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b71c5fe952ee2b8395e4441d59db7f1b13531b936c61b4938e9ec5542ec5d67a
|
|
| MD5 |
dfee96f93534a29287111f202dd8f235
|
|
| BLAKE2b-256 |
72bb9deb2695a545aa451d68bad44b25f0bdbcf4ba5e6bfbc8335fa5fecff9f4
|