Skip to main content

A Python package to extract chemical, biochemical, and bioactivity data from public databases like ORD, ChEMBL and PubChem.

Project description

biochemical-data-connectors

biochemical-data-connectors is a Python package for extracting chemical, biochemical, and bioactivity data from public databases like ChEMBL, PubChem, BindingDB, and the Open Reaction Database (ORD).

Overview

biochemical-data-connectors provides a simple and consistent interface to query major cheminformatics bioinformatics databases for compounds. It is designed to be a modular and reusable tool for researchers and developers in cheminformatics and drug discovery.

Key Features

  1. Bioactive Compounds
    • Unified Interface: A single, easy-to-use abstract base class for fetching bioactives for a given target.
    • Multiple Data Sources: Includes concrete connectors for major public databases:
      1. ChEMBL (ChEMBLBioactivesExtractor)
      2. PubChem (PubChemBioactivesExtractor)
    • Powerful Filtering: Filter compounds by bioactivity type (e.g., Kd, IC50) and potency value.
    • Efficient Fetching: Uses concurrency to fetch data from APIs efficiently.
  2. Chemical Reactions
    • Local ORD Processing: Includes a connector (OpenReactionDatabaseConnector) to efficiently process a local copy of the Open Reaction Database.
    • Reaction Role Correction: Uses RDKit to automatically correct and reassign reactant/product roles from the source data, improving data quality.
    • Robust SMILES Extraction: Canonicalizes and validates SMILES strings for both reactants and products to ensure high-quality, standardized output.
    • Memory-Efficient Processing: Employs a generator-based extraction method, allowing for iteration over massive reaction datasets with a low memory footprint.

Installation

You can install this package locally via:

pip install biochemical-data-connectors

Quick Start

Here is a simple example of how to retrieve all compounds from ChEMBL with a measured Kd of less than or equal to 1000 nM for the EGFR protein (UniProt ID: P00533).

import logging
from biochemical_data_connectors import ChEMBLConnector

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# 1. Instantiate the connector for the desired database
chembl_connector = ChEMBLConnector(
    bioactivity_measure='Kd',
    bioactivity_threshold=1000.0, # in nM
    logger=logger
)

# 2. Specify the target's UniProt ID
target_uniprot_id = "P00533" # EGFR

# 3. Get the bioactive compounds
print(f"Fetching bioactive compounds for {target_uniprot_id} from ChEMBL...")
smiles_list = chembl_connector.get_bioactive_compounds(target_uniprot_id)

# 4. Print the results
if smiles_list:
    print(f"\nFound {len(smiles_list)} compounds.")
    print("First 5 compounds:")
    for smiles in smiles_list[:5]:
        print(smiles)
else:
    print("No compounds found matching the criteria.")

Package Structure

biochemical-data-connectors/
├── pyproject.toml
├── requirements-dev.txt
├── src/
│   └── biochemical_data_connectors/
│       ├── __init__.py
│       ├── constants.py
│       ├── models.py
│       ├── connectors/
│       │   ├── __init__.py
│       │   ├── ord_connectors.py
│       │   └── bioactive_compounds
│       │       ├── __init__.py
│       │       ├── base_bioactives_connector.py
│       │       ├── bindingdb_bioactives_connector.py
│       │       ├── chembl_bioactives_connector.py
│       │       └── pubchem_bioactives_connector.py
│       └── utils/
│           ├── __init__.py
│           ├── files_utils.py
│           ├── iter_utils.py
│           ├── standardization_utils.py
│           └── api/
│               ├── __init__.py
│               ├── base_api.py
│               ├── bindingbd_api.py
│               ├── chembl_api.py
│               ├── mappings.py
│               └── pubchem_api.py
├── tests/
│   └── ...
└── README.md

License

This project is licensed under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biochemical_data_connectors-3.1.1.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biochemical_data_connectors-3.1.1-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file biochemical_data_connectors-3.1.1.tar.gz.

File metadata

File hashes

Hashes for biochemical_data_connectors-3.1.1.tar.gz
Algorithm Hash digest
SHA256 81bf33729006eca7945c0f1cd5335704f3671fd71b7d63fdf7708312af41356e
MD5 e1ea09fd8f10ee4d6e3ce6b7341b4942
BLAKE2b-256 2d93c9badbbc47e539087c44bc87a8d20c5ac78a6193001e93eae3375510115a

See more details on using hashes here.

Provenance

The following attestation bundles were made for biochemical_data_connectors-3.1.1.tar.gz:

Publisher: publish-to-pypi.yml on c-vandenberg/biochemical-data-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biochemical_data_connectors-3.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for biochemical_data_connectors-3.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 56bc50d7c19a8b874356837a6a44c92ece39f5619fe3649601b3a996ae5c0b8d
MD5 9f3129646b09c98388fbe71b784e7140
BLAKE2b-256 843c37205f0494454ff0149d7660f9ee7607165261396879a0c8e3b7f767b205

See more details on using hashes here.

Provenance

The following attestation bundles were made for biochemical_data_connectors-3.1.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on c-vandenberg/biochemical-data-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page