Skip to main content

Link ChEMBL bioactivity data with PDB structural information

Project description

ChEMBL-PDB Linker

CI PyPI version Python 3.9+ License: MIT

Link ChEMBL bioactivity data with PDB structural information to create a curated dataset of protein-ligand pairs with both activity measurements and 3D co-crystal structures. Similar to PDBbind but derived from ChEMBL and PDB directly.

Features

  • Validated protein-ligand pairs: Ensures PDB structure contains BOTH the target protein AND the ligand
  • Protein-level linking: Connect ChEMBL targets to PDB structures via UniProt IDs
  • Ligand-level linking: Match ChEMBL compounds to PDB ligands via InChIKey
  • RCSB Search API integration: Efficient bulk queries for ligand-to-structure mappings
  • Configurable filters: Activity types (IC50, Ki, Kd), confidence scores, resolution
  • Fully reproducible: End-to-end pipeline from raw data to curated output
  • Parquet output: Efficient, compressed output format

Expected Output

Running the full pipeline produces:

  • ~98,500 validated protein-ligand pairs with bioactivity data
  • ~9,000 unique compounds
  • ~14,700 unique PDB structures
  • ~1,300 unique target proteins

Installation

# Clone the repository
git clone https://github.com/HFooladi/chembl-pdb-linker.git
cd chembl-pdb-linker

# Option 1: Install with uv (recommended)
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

# Option 2: Install with pip
python -m venv .venv
source .venv/bin/activate
pip install -e .

Dependencies

  • Python >= 3.9
  • pandas, pyarrow
  • rdkit
  • httpx, requests
  • typer, pyyaml, tqdm

Quick Start

# Run complete pipeline (downloads ~5GB, takes ~30-60 min)
chembl-pdb-linker run

The output is saved to data/curated/bioactivity_pdb_linked.parquet.

Usage

Command Line

# Run complete pipeline
chembl-pdb-linker run

# Or run steps separately:
chembl-pdb-linker download          # Download ChEMBL, SIFTS, and PDB ligand data
chembl-pdb-linker link              # Link datasets with protein-ligand validation
chembl-pdb-linker extract           # Generate final curated dataset

# Show statistics
chembl-pdb-linker stats

# Use custom config
chembl-pdb-linker run --config my_config.yaml

Python API

from chembl_pdb_linker import Pipeline, Config

# Load config and create pipeline
config = Config.default()
pipeline = Pipeline(config)

# Run complete pipeline
output_path = pipeline.run()

# Or run steps separately
pipeline.download(chembl_version="34")
linked_df = pipeline.link()
output_path = pipeline.extract()

# Get statistics
stats = pipeline.get_statistics()

How It Works

Pipeline Overview

ChEMBL Database ──┬─→ Extract activities (confidence=9) ──┐
                  │                                        │
SIFTS Mapping ────┼─→ UniProt ↔ PDB mapping ──────────────┼──→ Protein-level linking
                  │                                        │
PDB Ligands ──────┼─→ Ligand code ↔ InChIKey mapping ─────┤
                  │                                        │
RCSB Search API ──┴─→ Ligand code → PDB structures ───────┴──→ Validated pairs

Key Algorithm: Protein-Ligand Pair Validation

The critical step is ensuring each linked pair is validated:

  1. Protein matching: ChEMBL target → UniProt ID → PDB structures (via SIFTS)
  2. Ligand matching: ChEMBL compound → InChIKey → PDB ligand codes
  3. Validation: For each potential match, verify the PDB structure contains BOTH the protein AND the ligand

Configuration

Edit config/default.yaml to customize:

chembl:
  confidence_score: 9          # Filter for single-protein targets
  activity_types:              # Activity types to include
    - IC50
    - Ki
    - Kd
    - EC50
  standard_units:
    - nM

pdb:
  max_resolution: 3.5          # Maximum structure resolution (Å)
  rcsb_search_api: "https://search.rcsb.org/rcsbsearch/v2/query"

linking:
  use_inchikey: true           # Match ligands by InChIKey
  inchikey_connectivity_only: false  # Use full InChIKey or just first 14 chars

Output Schema

The final Parquet file (data/curated/bioactivity_pdb_linked.parquet) contains:

Column Description
chembl_id ChEMBL compound ID
smiles Canonical SMILES
inchikey Standard InChIKey
uniprot_id Target UniProt accession
target_name Target protein name
activity_type IC50/Ki/Kd/EC50
activity_value Activity value
activity_unit Unit (typically nM)
pchembl pChEMBL value (-log10(M))
pdb_id PDB structure ID
pdb_ligand_code 3-letter HET code
resolution Structure resolution (Å)
rcsb_url Download URL for structure

Data Sources

Comparison to PDBbind

Aspect PDBbind ChEMBL-PDB Linker
Size ~20K complexes ~98K pairs
Affinity source Literature curation ChEMBL database
Structure source PDB PDB
Update frequency Annual On-demand
Reproducibility Manual Fully automated

Citation

If you use this tool in your research, please cite:

@software{chembl_pdb_linker,
  author = {Fooladi, Hosein},
  title = {ChEMBL-PDB Linker: Linking Bioactivity Data with 3D Structures},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/HFooladi/chembl-pdb-linker}
}

Please also cite the underlying data sources:

  • ChEMBL: Zdrazil, B., et al. (2024). The ChEMBL Database in 2023. Nucleic Acids Research, 52(D1), D1180-D1192.
  • PDB: Berman, H.M., et al. (2000). The Protein Data Bank. Nucleic Acids Research, 28(1), 235-242.
  • SIFTS: Dana, J.M., et al. (2019). SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Research, 47(D1), D482-D489.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chembl_pdb_linker-0.1.0.tar.gz (732.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chembl_pdb_linker-0.1.0-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file chembl_pdb_linker-0.1.0.tar.gz.

File metadata

  • Download URL: chembl_pdb_linker-0.1.0.tar.gz
  • Upload date:
  • Size: 732.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chembl_pdb_linker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bbc27f635ea934494da15cca2b495d0373208d3ca0c3775d3c6d63ad2c4f724b
MD5 56457f408a089649a8273d04662f9408
BLAKE2b-256 0dbf1026b394cbc7438a9f4c9e5305da6676a7a14394b126d339bdefa83416aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for chembl_pdb_linker-0.1.0.tar.gz:

Publisher: publish.yml on HFooladi/chembl-pdb-linker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chembl_pdb_linker-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chembl_pdb_linker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc586e74af37cac7176f50e26a3e22dd86dd9a10d5cac6576dddd898736882b8
MD5 87bc9d68f5f530da7928c801c459ed03
BLAKE2b-256 664cca73a151dbf07dd107d17d06ea78ed9aeb1c8f0584b206ea012b96aed47e

See more details on using hashes here.

Provenance

The following attestation bundles were made for chembl_pdb_linker-0.1.0-py3-none-any.whl:

Publisher: publish.yml on HFooladi/chembl-pdb-linker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page