Manifest-backed real-data ingestion and OpenML materialization for tabular workflows
Project description
tab-realdata-hub
tab-realdata-hub materializes external tabular data sources into the
manifest-backed packed-shard contract consumed by tab-foundry.
tab-realdata-hub is the sole owner of that manifest contract. The parquet
manifest is the stable index layer, and richer evolving dataset/provenance
fields live in metadata.ndjson. Downstream consumers are expected to read
through this package rather than reimplementing compatibility shims.
Install from the upstream git tag with:
python -m pip install tab-realdata-hub
For repo-local development:
uv sync
The v1 surface is OpenML-first:
- build pinned OpenML bundle JSON from known task pools or live discovery
- materialize bundle tasks into packed shards plus manifest parquet
- inspect manifest-backed datasets through a stable library and CLI surface
Example:
uv sync
.venv/bin/tab-realdata-hub bundle build-openml \
--out-path bundles/many_class_v1.json \
--bundle-name many_class_v1 \
--version 1 \
--task-source tabarena_v0_1 \
--min-classes 2 \
--max-features 10 \
--max-classes 10 \
--max-missing-pct 10.0
.venv/bin/tab-realdata-hub materialize openml-bundle \
--bundle-path bundles/many_class_v1.json \
--out-root outputs/openml/many_class_v1
.venv/bin/tab-realdata-hub manifest inspect \
--manifest outputs/openml/many_class_v1/manifest.parquet
The repo now tracks two hub-owned classification validation bundles for
tab-foundry under src/tab_realdata_hub/bench/:
openml_classification_medium_v1.jsonopenml_classification_large_v1.json
The current TF-RD-010 contract is:
medium: no-missing multiclass validation withmax_features=10,min_classes=3,max_classes=10, andmin_minority_class_pct=2.5large: allow-missing multiclass validation withmax_features=20,max_missing_pct=5.0,min_classes=3,max_classes=10, andmin_minority_class_pct=2.5
Refresh the checked-in bundle definitions from the pinned tabarena_v0_1
source with:
.venv/bin/tab-realdata-hub bundle build-openml \
--out-path src/tab_realdata_hub/bench/openml_classification_medium_v1.json \
--bundle-name openml_classification_medium \
--version 1 \
--task-source tabarena_v0_1 \
--new-instances 200 \
--max-features 10 \
--min-classes 3 \
--max-classes 10 \
--max-missing-pct 0.0 \
--min-minority-class-pct 2.5
.venv/bin/tab-realdata-hub bundle build-openml \
--out-path src/tab_realdata_hub/bench/openml_classification_large_v1.json \
--bundle-name openml_classification_large \
--version 1 \
--task-source tabarena_v0_1 \
--new-instances 200 \
--max-features 20 \
--min-classes 3 \
--max-classes 10 \
--max-missing-pct 5.0 \
--min-minority-class-pct 2.5
Materialize the checked-in bundle definitions into the manifest paths consumed
downstream by tab-foundry with:
.venv/bin/tab-realdata-hub materialize openml-bundle \
--bundle-path src/tab_realdata_hub/bench/openml_classification_medium_v1.json \
--out-root data/manifests/bench/openml_classification_medium_v1
.venv/bin/tab-realdata-hub materialize openml-bundle \
--bundle-path src/tab_realdata_hub/bench/openml_classification_large_v1.json \
--out-root data/manifests/bench/openml_classification_large_v1
Inspect the resulting manifests with:
.venv/bin/tab-realdata-hub manifest inspect \
--manifest data/manifests/bench/openml_classification_medium_v1/manifest.parquet
.venv/bin/tab-realdata-hub manifest inspect \
--manifest data/manifests/bench/openml_classification_large_v1/manifest.parquet
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tab_realdata_hub-0.2.0.tar.gz.
File metadata
- Download URL: tab_realdata_hub-0.2.0.tar.gz
- Upload date:
- Size: 94.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee5e9da71cd9c8e8e7d69183b911fc8afb294e042f6d899dc02dcc0a1ef59818
|
|
| MD5 |
0a9d26e8abbe559bf85e7dab6b2924f0
|
|
| BLAKE2b-256 |
56bbb8639edcd1cc9e0c15370a93d5050c10622d6a00b66e867fc1450f00c228
|
Provenance
The following attestation bundles were made for tab_realdata_hub-0.2.0.tar.gz:
Publisher:
publish.yml on bensonlee5/tab-realdata-hub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tab_realdata_hub-0.2.0.tar.gz -
Subject digest:
ee5e9da71cd9c8e8e7d69183b911fc8afb294e042f6d899dc02dcc0a1ef59818 - Sigstore transparency entry: 1296991787
- Sigstore integration time:
-
Permalink:
bensonlee5/tab-realdata-hub@8d2d455bc1ce1a2886d9b47aa5c71c382ef9f420 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/bensonlee5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8d2d455bc1ce1a2886d9b47aa5c71c382ef9f420 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file tab_realdata_hub-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tab_realdata_hub-0.2.0-py3-none-any.whl
- Upload date:
- Size: 54.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e10a0177657a2b0b7a76129442e8f7e1c44dc1b13fa9a8cd0934287949026fe2
|
|
| MD5 |
5770539c32970b3e0205af50a5463d53
|
|
| BLAKE2b-256 |
c8c17673972385c6dab5f1d1c79c85e3cb306abe427e74c8471c6deaedc87b94
|
Provenance
The following attestation bundles were made for tab_realdata_hub-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on bensonlee5/tab-realdata-hub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tab_realdata_hub-0.2.0-py3-none-any.whl -
Subject digest:
e10a0177657a2b0b7a76129442e8f7e1c44dc1b13fa9a8cd0934287949026fe2 - Sigstore transparency entry: 1296991847
- Sigstore integration time:
-
Permalink:
bensonlee5/tab-realdata-hub@8d2d455bc1ce1a2886d9b47aa5c71c382ef9f420 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/bensonlee5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8d2d455bc1ce1a2886d9b47aa5c71c382ef9f420 -
Trigger Event:
workflow_dispatch
-
Statement type: