Skip to main content

Common processing functionality for the ChEBI ontology

Project description

python-chebi-utils

Common processing functionality for the ChEBI ontology — download versioned data files, build an ontology graph, extract molecules, assemble labeled datasets, and generate stratified train/val/test splits.

Installation

pip install chebi-utils

For development (includes pytest and ruff):

pip install -e ".[dev]"

Features

Download ChEBI data files

from chebi_utils import download_chebi_obo, download_chebi_sdf

obo_path = download_chebi_obo(version=248, dest_dir="data/")   # downloads chebi.obo
sdf_path = download_chebi_sdf(version=248, dest_dir="data/")   # downloads chebi.sdf.gz

A specific ChEBI release version (e.g. 230, 245, 248) must be provided. Files are fetched from the EBI FTP server. Versions below 245 are automatically fetched from the legacy archive path.

Build the ChEBI ontology graph

from chebi_utils import build_chebi_graph

graph = build_chebi_graph("chebi.obo")
# networkx.DiGraph — nodes are string ChEBI IDs (e.g. "1" for CHEBI:1)
# node attributes: name, smiles, subset
# edge attribute:  relation  ("is_a", "has_part", …)

Obsolete terms are excluded automatically. xref: lines are stripped before parsing to work around known fastobo compatibility issues in some ChEBI releases.

To obtain only the is_a hierarchy as a subgraph:

from chebi_utils.obo_extractor import get_hierarchy_subgraph

hierarchy = get_hierarchy_subgraph(graph)

Extract molecules

from chebi_utils import extract_molecules

molecules = extract_molecules("chebi.sdf.gz")
# DataFrame columns: chebi_id, name, inchi, inchikey, smiles, charge, mass, mol, …
# mol column contains RDKit Mol objects (None when parsing fails)

Both plain .sdf and gzip-compressed .sdf.gz files are supported. Molecules that cannot be parsed are excluded from the returned DataFrame.

Build a labeled dataset

from chebi_utils import build_labeled_dataset

dataset, labels = build_labeled_dataset(graph, molecules, min_molecules=50)
# dataset — DataFrame with columns: chebi_id, mol, <label1>, <label2>, …
#            one boolean column per selected ontology class
# labels  — sorted list of ChEBI IDs selected as label classes

Each molecule is assigned to every label class that it belongs to directly or through a chain of is_a relationships. Only classes with at least min_molecules descendant molecules are kept as labels.

Generate stratified train/val/test splits

from chebi_utils import create_multilabel_splits

splits = create_multilabel_splits(dataset, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1)
train_df = splits["train"]
val_df   = splits["val"]
test_df  = splits["test"]

Columns 0 and 1 (chebi_id, mol) are treated as metadata; all remaining columns are treated as binary label columns. When multiple label columns are present, MultilabelStratifiedShuffleSplit from the iterative-stratification package is used; for a single label column, StratifiedShuffleSplit from scikit-learn is used.

Running Tests

pytest tests/ -v

Linting

ruff check .
ruff format --check .

CI/CD

A GitHub Actions workflow (.github/workflows/ci.yml) automatically runs ruff linting and the full test suite on every push and pull request across Python 3.10, 3.11, and 3.12.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chebi_utils-0.1.1.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chebi_utils-0.1.1-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file chebi_utils-0.1.1.tar.gz.

File metadata

  • Download URL: chebi_utils-0.1.1.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chebi_utils-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d35c8e4819129c6c7354179062ffb5d8b201bed325e4fcd1b4d692cf449bd7be
MD5 430d8de794b29c32a8e04c3d43167ca7
BLAKE2b-256 f8e50ce466ac944930bf056ede9ebce30ad99419916feee2c06b65609f3141d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for chebi_utils-0.1.1.tar.gz:

Publisher: python-publish.yml on ChEB-AI/python-chebi-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chebi_utils-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: chebi_utils-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chebi_utils-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f6392a2cb23abe98235ac9a7568c6a7cde5988f16dee952650633312edab20e6
MD5 bb56b9a749a3ba1141d12d0e1c2432cd
BLAKE2b-256 89cf7edbec949fe33326d4f958c0c13aff7155201377a5f59c477fa3662f4e50

See more details on using hashes here.

Provenance

The following attestation bundles were made for chebi_utils-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on ChEB-AI/python-chebi-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page