Skip to main content

Manifest-backed real-data ingestion and OpenML materialization for tabular workflows

Project description

tab-realdata-hub

tab-realdata-hub materializes external tabular data sources into the manifest-backed packed-shard contract consumed by tab-foundry.

tab-realdata-hub is the sole owner of that manifest contract. The parquet manifest is the stable index layer, and richer evolving dataset/provenance fields live in metadata.ndjson. Downstream consumers are expected to read through this package rather than reimplementing compatibility shims.

Install from the upstream git tag with:

python -m pip install "tab-realdata-hub @ git+https://github.com/bensonlee5/tab-realdata-hub.git@v0.1.1"

For repo-local development:

uv sync

The v1 surface is OpenML-first:

  • build pinned OpenML bundle JSON from known task pools or live discovery
  • materialize bundle tasks into packed shards plus manifest parquet
  • inspect manifest-backed datasets through a stable library and CLI surface

Example:

uv sync

.venv/bin/tab-realdata-hub bundle build-openml \
  --out-path bundles/many_class_v1.json \
  --bundle-name many_class_v1 \
  --version 1 \
  --task-source tabarena_v0_1 \
  --min-classes 2 \
  --max-features 10 \
  --max-classes 10 \
  --max-missing-pct 10.0

.venv/bin/tab-realdata-hub materialize openml-bundle \
  --bundle-path bundles/many_class_v1.json \
  --out-root outputs/openml/many_class_v1

.venv/bin/tab-realdata-hub manifest inspect \
  --manifest outputs/openml/many_class_v1/manifest.parquet

The repo now tracks two hub-owned classification validation bundles for tab-foundry under src/tab_realdata_hub/bench/:

  • nanotabpfn_openml_classification_medium_v1.json
  • nanotabpfn_openml_classification_large_v1.json

The current TF-RD-010 contract is:

  • medium: no-missing multiclass validation with max_features=10, min_classes=3, max_classes=10, and min_minority_class_pct=2.5
  • large: allow-missing multiclass validation with max_features=20, max_missing_pct=5.0, min_classes=3, max_classes=10, and min_minority_class_pct=2.5

Refresh the checked-in bundle definitions from the pinned tabarena_v0_1 source with:

.venv/bin/tab-realdata-hub bundle build-openml \
  --out-path src/tab_realdata_hub/bench/nanotabpfn_openml_classification_medium_v1.json \
  --bundle-name nanotabpfn_openml_classification_medium \
  --version 1 \
  --task-source tabarena_v0_1 \
  --new-instances 200 \
  --max-features 10 \
  --min-classes 3 \
  --max-classes 10 \
  --max-missing-pct 0.0 \
  --min-minority-class-pct 2.5

.venv/bin/tab-realdata-hub bundle build-openml \
  --out-path src/tab_realdata_hub/bench/nanotabpfn_openml_classification_large_v1.json \
  --bundle-name nanotabpfn_openml_classification_large \
  --version 1 \
  --task-source tabarena_v0_1 \
  --new-instances 200 \
  --max-features 20 \
  --min-classes 3 \
  --max-classes 10 \
  --max-missing-pct 5.0 \
  --min-minority-class-pct 2.5

Materialize the checked-in bundle definitions into the manifest paths consumed downstream by tab-foundry with:

.venv/bin/tab-realdata-hub materialize openml-bundle \
  --bundle-path src/tab_realdata_hub/bench/nanotabpfn_openml_classification_medium_v1.json \
  --out-root data/manifests/bench/nanotabpfn_openml_classification_medium_v1

.venv/bin/tab-realdata-hub materialize openml-bundle \
  --bundle-path src/tab_realdata_hub/bench/nanotabpfn_openml_classification_large_v1.json \
  --out-root data/manifests/bench/nanotabpfn_openml_classification_large_v1

Inspect the resulting manifests with:

.venv/bin/tab-realdata-hub manifest inspect \
  --manifest data/manifests/bench/nanotabpfn_openml_classification_medium_v1/manifest.parquet

.venv/bin/tab-realdata-hub manifest inspect \
  --manifest data/manifests/bench/nanotabpfn_openml_classification_large_v1/manifest.parquet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tab_realdata_hub-0.1.1.tar.gz (82.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tab_realdata_hub-0.1.1-py3-none-any.whl (49.3 kB view details)

Uploaded Python 3

File details

Details for the file tab_realdata_hub-0.1.1.tar.gz.

File metadata

  • Download URL: tab_realdata_hub-0.1.1.tar.gz
  • Upload date:
  • Size: 82.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tab_realdata_hub-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fa8e4a3a081e1e73f348f3d5b7f6cdc07204c258a48e3e213713cd9264e38f63
MD5 417e0c1cb3259b98ff2dc802629731f4
BLAKE2b-256 1dbd0acf1d07d849b8c5373ae1e5cdce154dac150fbfb8368bfd77128058c8dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab_realdata_hub-0.1.1.tar.gz:

Publisher: publish.yml on bensonlee5/tab-realdata-hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tab_realdata_hub-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tab_realdata_hub-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7153081bc975fc83e5fbb76d1eea0af408e18f5ac0058022c741d42ec4087655
MD5 b53e15970ef1ab0ddf9cd4a43a939e41
BLAKE2b-256 b88d66738ff6893617a12cbfe19c4b5c4a6028296cafc3a05505a9ae634e4435

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab_realdata_hub-0.1.1-py3-none-any.whl:

Publisher: publish.yml on bensonlee5/tab-realdata-hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page