A Python package to extract chemical, biochemical, and bioactivity data from public databases like ORD, ChEMBL and PubChem.
Project description
biochemical-data-connectors
biochemical-data-connectors is a Python package for extracting chemical, biochemical, and bioactivity data from public databases like ChEMBL, PubChem, BindingDB, IUPHAR/BPS Guide to PHARMACOLOGY, and the Open Reaction Database (ORD).
Overview
biochemical-data-connectors provides a simple and consistent interface to query major cheminformatics bioinformatics databases for compounds. It is designed to be a modular and reusable tool for researchers and developers in computational chemistry and drug discovery, enabling the rapid curation of high-quality datasets for machine learning and analysis.
Key Features
- Bioactive Compounds
- Unified Interface: A single, easy-to-use abstract base class for fetching bioactives for a given target.
- Multiple Data Sources: Includes concrete connectors for major public databases:
- ChEMBL (
ChemblBioactivesExtractor) - PubChem (
PubChemBioactivesExtractor) - BindingDB (
BindingDbBioactivesConnector) - IUPHAR/BPS Guide to PHARMACOLOGY (IUPHARBioactivesConnector)
- ChEMBL (
- Powerful Filtering: Filter compounds by bioactivity type (e.g., Kd, IC50) and potency value.
- Efficient Fetching: Uses concurrency to fetch data from APIs efficiently.
- Chemical Reactions
- Local ORD Processing: Includes a connector (
OpenReactionDatabaseConnector) to efficiently process a local copy of the Open Reaction Database. - Reaction Role Correction: Uses RDKit to automatically correct and reassign reactant/product roles from the source data, improving data quality.
- Robust SMILES Extraction: Canonicalizes and validates SMILES strings for both reactants and products to ensure high-quality, standardized output.
- Memory-Efficient Processing: Employs a generator-based extraction method, allowing for iteration over massive reaction datasets with a low memory footprint.
- Local ORD Processing: Includes a connector (
Installation
You can install this package locally via:
pip install biochemical-data-connectors
Quick Start
Here is a simple example of how to retrieve all compounds from ChEMBL with a measured Kd of less than or equal to 1000 nM for the EGFR protein (UniProt ID: P00533).
import logging
from biochemical_data_connectors import ChEMBLConnector
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# 1. Instantiate the connector for the desired database
chembl_connector = ChEMBLConnector(
bioactivity_measure='Kd',
bioactivity_threshold=1000.0, # in nM
logger=logger
)
# 2. Specify the target's UniProt ID
target_uniprot_id = "P00533" # EGFR
# 3. Get the bioactive compounds
print(f"Fetching bioactive compounds for {target_uniprot_id} from ChEMBL...")
smiles_list = chembl_connector.get_bioactive_compounds(target_uniprot_id)
# 4. Print the results
if smiles_list:
print(f"\nFound {len(smiles_list)} compounds.")
print("First 5 compounds:")
for smiles in smiles_list[:5]:
print(smiles)
else:
print("No compounds found matching the criteria.")
Package Structure
biochemical-data-connectors/
├── pyproject.toml
├── requirements-dev.txt
├── src/
│ └── biochemical_data_connectors/
│ ├── __init__.py
│ ├── constants.py
│ ├── models.py
│ ├── connectors/
│ │ ├── __init__.py
│ │ ├── ord_connectors.py
│ │ └── bioactive_compounds
│ │ ├── __init__.py
│ │ ├── base_bioactives_connector.py
│ │ ├── bindingdb_bioactives_connector.py
│ │ ├── chembl_bioactives_connector.py
│ │ ├── iuphar_bioactives_connector.py
│ │ └── pubchem_bioactives_connector.py
│ └── utils/
│ ├── __init__.py
│ ├── files_utils.py
│ ├── iter_utils.py
│ ├── standardization_utils.py
│ └── api/
│ ├── __init__.py
│ ├── base_api.py
│ ├── bindingbd_api.py
│ ├── chembl_api.py
│ ├── iuphar_api.py
│ ├── mappings.py
│ └── pubchem_api.py
├── tests/
│ └── ...
└── README.md
License
This project is licensed under the terms of the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biochemical_data_connectors-3.2.2.tar.gz.
File metadata
- Download URL: biochemical_data_connectors-3.2.2.tar.gz
- Upload date:
- Size: 27.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcc588bc2ad3fbec1d0c7d836d971b5fbd464b5b49e354ada73a761da0933a68
|
|
| MD5 |
7049bd37b79a2b1dc3239267815d90ab
|
|
| BLAKE2b-256 |
5ae459e91a717041deb544cdeb8b3cfcecee31a938ea3978300c45013af28ea8
|
Provenance
The following attestation bundles were made for biochemical_data_connectors-3.2.2.tar.gz:
Publisher:
publish-to-pypi.yml on c-vandenberg/biochemical-data-connectors
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biochemical_data_connectors-3.2.2.tar.gz -
Subject digest:
dcc588bc2ad3fbec1d0c7d836d971b5fbd464b5b49e354ada73a761da0933a68 - Sigstore transparency entry: 349948115
- Sigstore integration time:
-
Permalink:
c-vandenberg/biochemical-data-connectors@44c906acd51b7350c3f282ca054b05c4c79b3e06 -
Branch / Tag:
refs/tags/v3.2.2 - Owner: https://github.com/c-vandenberg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@44c906acd51b7350c3f282ca054b05c4c79b3e06 -
Trigger Event:
release
-
Statement type:
File details
Details for the file biochemical_data_connectors-3.2.2-py3-none-any.whl.
File metadata
- Download URL: biochemical_data_connectors-3.2.2-py3-none-any.whl
- Upload date:
- Size: 38.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b31f8beda60056c6ad9063bc3b1be62f13caa69517ee1bfcedf2b89c373db42
|
|
| MD5 |
cfb89ee0faacedd708d765db60c1a121
|
|
| BLAKE2b-256 |
9a2aa79d736199db9eb4afed367e3132f0ca8e87b238084a9262a8b19ec73935
|
Provenance
The following attestation bundles were made for biochemical_data_connectors-3.2.2-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on c-vandenberg/biochemical-data-connectors
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biochemical_data_connectors-3.2.2-py3-none-any.whl -
Subject digest:
0b31f8beda60056c6ad9063bc3b1be62f13caa69517ee1bfcedf2b89c373db42 - Sigstore transparency entry: 349948136
- Sigstore integration time:
-
Permalink:
c-vandenberg/biochemical-data-connectors@44c906acd51b7350c3f282ca054b05c4c79b3e06 -
Branch / Tag:
refs/tags/v3.2.2 - Owner: https://github.com/c-vandenberg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@44c906acd51b7350c3f282ca054b05c4c79b3e06 -
Trigger Event:
release
-
Statement type: