Skip to main content

Data ingestion, caching, and parquet materialization for the Refua drug discovery ecosystem.

Project description

refua-data

refua-data is the Refua data layer for drug discovery. It provides a curated dataset catalog, intelligent local caching, and parquet materialization optimized for downstream modeling and campaign workflows.

What it provides

  • A built-in catalog of useful drug-discovery datasets.
  • Dataset-aware download pipeline with cache reuse and metadata tracking.
  • Pluggable cache backend architecture (filesystem cache by default).
  • API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
  • HTTP conditional refresh support (ETag / Last-Modified) when enabled.
  • Incremental parquet materialization (chunked processing + partitioned parquet parts).
  • CLI for listing, fetching, and materializing datasets.
  • Query interface for filtered row access from materialized parquet datasets.
  • Source health checks via validate-sources for CI and environment diagnostics.
  • Rich dataset metadata snapshots (description + usage notes) persisted in cache metadata.

Included datasets

The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including ZINC, ChEMBL, UniProt, openFDA, and the Human Protein Atlas.

  1. zinc15_250k (ZINC)
  2. zinc15_tranche_druglike_instock (ZINC tranche)
  3. zinc15_tranche_druglike_agent (ZINC tranche)
  4. zinc15_tranche_druglike_wait_ok (ZINC tranche)
  5. zinc15_tranche_druglike_boutique (ZINC tranche)
  6. zinc15_tranche_druglike_annotated (ZINC tranche)
  7. tox21
  8. bbbp
  9. bace
  10. clintox
  11. sider
  12. hiv
  13. muv
  14. esol
  15. freesolv
  16. lipophilicity
  17. pcba
  18. openfda_drug_event_serious
  19. proteinatlas_human_proteome
  20. chembl_activity_ki_human
  21. chembl_activity_ic50_human
  22. chembl_activity_kd_human
  23. chembl_activity_ec50_human
  24. chembl_activity_ac50_human
  25. chembl_assays_binding_human
  26. chembl_assays_functional_human
  27. chembl_assays_adme_human
  28. chembl_targets_human_single_protein
  29. chembl_targets_human_protein_complex
  30. chembl_molecules_phase3plus
  31. chembl_molecules_phase4
  32. chembl_molecules_black_box_warning
  33. chembl_mechanism_phase2plus
  34. chembl_drug_indications_phase2plus
  35. chembl_drug_indications_phase3plus
  36. uniprot_human_reviewed
  37. uniprot_human_receptors
  38. uniprot_human_membrane
  39. uniprot_human_nucleus
  40. uniprot_human_kinases
  41. uniprot_human_gpcr
  42. uniprot_human_ion_channels
  43. uniprot_human_transporters
  44. uniprot_human_secreted
  45. uniprot_human_transcription_factors
  46. uniprot_human_enzymes

Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms. ChEMBL, UniProt, and openFDA presets are fetched through their public REST APIs and cached locally as JSONL. ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins, reactivity A/B/C/E) into one cached tabular source during fetch.

Install

cd refua-data
pip install -e .

CLI quickstart

List datasets:

refua-data list

Validate all dataset sources:

refua-data validate-sources

Validate a subset and fail CI on probe failures:

refua-data validate-sources chembl_activity_ki_human uniprot_human_kinases --fail-on-error

JSON output for automation:

refua-data validate-sources --json --fail-on-error

For datasets with multiple mirrors, source validation succeeds when at least one configured source is reachable. Failed fallback attempts are included in the result details.

Fetch raw data with cache:

refua-data fetch zinc15_250k

Fetch API-based presets:

refua-data fetch chembl_activity_ki_human
refua-data fetch uniprot_human_kinases

Materialize parquet:

refua-data materialize zinc15_250k

Query materialized parquet rows:

refua-data query zinc15_250k --columns smiles,logP --filters '{"logP":{"lt":2.5}}' --limit 50

Refresh against remote metadata:

refua-data fetch zinc15_250k --refresh

For API datasets, --refresh re-runs the API query (with conditional headers on first page when available).

Cache layout

By default, cache root is:

  • ~/.cache/refua-data

Override with:

  • REFUA_DATA_HOME=/custom/path

Layout:

  • raw/<dataset>/<version>/... downloaded source files
  • _meta/raw/<dataset>/<version>/...json raw metadata (etag, sha256, API request signature, rows/pages, dataset description/usage metadata)
  • parquet/<dataset>/<version>/part-*.parquet materialized parquet parts
  • _meta/parquet/<dataset>/<version>/manifest.json parquet manifest metadata with dataset snapshot

Python API

from refua_data import DatasetManager

manager = DatasetManager()
manager.fetch("zinc15_250k")
manager.fetch("chembl_activity_ki_human")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)

DataCache is the default cache backend. You can pass a custom backend object that implements the same interface (ensure, raw_file, raw_meta, parquet_dir, parquet_manifest, read_json, write_json) to make storage pluggable.

Licensing notes

  • refua-data package code is MIT licensed.
  • Dataset content licenses are dataset-specific and controlled by upstream providers.
  • Always verify dataset licensing and allowed use before redistribution or commercial deployment.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refua_data-0.7.0.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refua_data-0.7.0-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file refua_data-0.7.0.tar.gz.

File metadata

  • Download URL: refua_data-0.7.0.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_data-0.7.0.tar.gz
Algorithm Hash digest
SHA256 7b50c903be413c7a00493bb89fd49793e41fe6747d47921bea88b7fcc0e53660
MD5 5e4fc130eb97cdb62b3783ed4e1ef8bb
BLAKE2b-256 bfefcf59d582b701c8c5ec15027c19e9a895cf5198667afb86ac25252c25b566

See more details on using hashes here.

File details

Details for the file refua_data-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: refua_data-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 31.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_data-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c664320d8595b6d05be16c7d86a0cadb3674b88f573a286a397a73780dfaea34
MD5 eb2e65246698a263da25c856fb4e68b1
BLAKE2b-256 c82b7f0a13f39b6cb49aea305fa4810337064b5afb9482daee31d0d69cc9d20a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page