Data ingestion, caching, and parquet materialization for the Refua drug discovery ecosystem.
Project description
refua-data
refua-data is the Refua data layer for drug discovery. It provides a curated dataset catalog, intelligent local caching, and parquet materialization optimized for downstream modeling and campaign workflows.
What it provides
- A built-in catalog of useful drug-discovery datasets.
- Dataset-aware download pipeline with cache reuse and metadata tracking.
- Pluggable cache backend architecture (filesystem cache by default).
- API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
- HTTP conditional refresh support (
ETag/Last-Modified) when enabled. - Incremental parquet materialization (chunked processing + partitioned parquet parts).
- CLI for listing, fetching, and materializing datasets.
- Source health checks via
validate-sourcesfor CI and environment diagnostics. - Rich dataset metadata snapshots (description + usage notes) persisted in cache metadata.
Included datasets
The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including ZINC, ChEMBL, and UniProt.
zinc15_250k(ZINC)zinc15_tranche_druglike_instock(ZINC tranche)zinc15_tranche_druglike_agent(ZINC tranche)zinc15_tranche_druglike_wait_ok(ZINC tranche)zinc15_tranche_druglike_boutique(ZINC tranche)zinc15_tranche_druglike_annotated(ZINC tranche)tox21bbbpbaceclintoxsiderhivmuvesolfreesolvlipophilicitypcbachembl_activity_ki_humanchembl_activity_ic50_humanchembl_assays_binding_humanchembl_targets_human_single_proteinchembl_molecules_phase3plusuniprot_human_revieweduniprot_human_kinasesuniprot_human_gpcruniprot_human_ion_channelsuniprot_human_transporters
Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms. ChEMBL and UniProt presets are fetched through their public REST APIs and cached locally as JSONL. ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins, reactivity A/B/C/E) into one cached tabular source during fetch.
Install
cd refua-data
pip install -e .
CLI quickstart
List datasets:
refua-data list
Validate all dataset sources:
refua-data validate-sources
Validate a subset and fail CI on probe failures:
refua-data validate-sources chembl_activity_ki_human uniprot_human_kinases --fail-on-error
JSON output for automation:
refua-data validate-sources --json --fail-on-error
For datasets with multiple mirrors, source validation succeeds when at least one configured source is reachable. Failed fallback attempts are included in the result details.
Fetch raw data with cache:
refua-data fetch zinc15_250k
Fetch API-based presets:
refua-data fetch chembl_activity_ki_human
refua-data fetch uniprot_human_kinases
Materialize parquet:
refua-data materialize zinc15_250k
Refresh against remote metadata:
refua-data fetch zinc15_250k --refresh
For API datasets, --refresh re-runs the API query (with conditional headers on first page when available).
Cache layout
By default, cache root is:
~/.cache/refua-data
Override with:
REFUA_DATA_HOME=/custom/path
Layout:
raw/<dataset>/<version>/...downloaded source files_meta/raw/<dataset>/<version>/...jsonraw metadata (etag,sha256, API request signature, rows/pages, dataset description/usage metadata)parquet/<dataset>/<version>/part-*.parquetmaterialized parquet parts_meta/parquet/<dataset>/<version>/manifest.jsonparquet manifest metadata with dataset snapshot
Python API
from refua_data import DatasetManager
manager = DatasetManager()
manager.fetch("zinc15_250k")
manager.fetch("chembl_activity_ki_human")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)
DataCache is the default cache backend. You can pass a custom backend object that implements
the same interface (ensure, raw_file, raw_meta, parquet_dir, parquet_manifest,
read_json, write_json) to make storage pluggable.
Licensing notes
refua-datapackage code is MIT licensed.- Dataset content licenses are dataset-specific and controlled by upstream providers.
- Always verify dataset licensing and allowed use before redistribution or commercial deployment.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file refua_data-0.6.0.tar.gz.
File metadata
- Download URL: refua_data-0.6.0.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86b4bba808a26674b921c9cbe3614399544e0957c13acba851912191a8648e45
|
|
| MD5 |
e219f1c0ebb7f44e0264c53e89ac4d6a
|
|
| BLAKE2b-256 |
a0665cdca92fccb3d4f3dd9bc0144f1baf59fe0258b6cdf3503dd03102f388eb
|
File details
Details for the file refua_data-0.6.0-py3-none-any.whl.
File metadata
- Download URL: refua_data-0.6.0-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de06d2bb91f33a2dbeedc9808891ecf27ed567c441182ddc2a0e09085ba33984
|
|
| MD5 |
7468f714ea14df4da3e3c95fb5c8db8b
|
|
| BLAKE2b-256 |
f5c967102935d45d65d9b8c40de63dff6317799202324a9ed267c63f8be8bf07
|