Skip to main content

Manifest-backed real-data ingestion and OpenML materialization for tabular workflows

Project description

tab-realdata-hub

tab-realdata-hub materializes external tabular data sources into the manifest-backed packed-shard contract consumed by tab-foundry.

tab-realdata-hub is the sole owner of that manifest contract. The parquet manifest is the stable index layer, and richer evolving dataset/provenance fields live in metadata.ndjson. Downstream consumers are expected to read through this package rather than reimplementing compatibility shims.

Install from the upstream git tag with:

python -m pip install "tab-realdata-hub @ git+https://github.com/bensonlee5/tab-realdata-hub.git@v0.1.0"

For repo-local development:

uv sync

The v1 surface is OpenML-first:

  • build pinned OpenML bundle JSON from known task pools or live discovery
  • materialize bundle tasks into packed shards plus manifest parquet
  • inspect manifest-backed datasets through a stable library and CLI surface

Example:

uv sync

tab-realdata-hub bundle build-openml \
  --out-path bundles/many_class_v1.json \
  --bundle-name many_class_v1 \
  --version 1 \
  --task-source tabarena_v0_1 \
  --max-features 10 \
  --max-classes 10 \
  --max-missing-pct 10.0

tab-realdata-hub materialize openml-bundle \
  --bundle-path bundles/many_class_v1.json \
  --out-root outputs/openml/many_class_v1

tab-realdata-hub manifest inspect \
  --manifest outputs/openml/many_class_v1/manifest.parquet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tab_realdata_hub-0.1.0.tar.gz (60.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tab_realdata_hub-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file tab_realdata_hub-0.1.0.tar.gz.

File metadata

  • Download URL: tab_realdata_hub-0.1.0.tar.gz
  • Upload date:
  • Size: 60.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tab_realdata_hub-0.1.0.tar.gz
Algorithm Hash digest
SHA256 19d14edac43f059f8e66aa3dbc90b8a9c4177d58ba91c704c52b6859be5e9324
MD5 196273716381e866724cd8f3a80835bb
BLAKE2b-256 ca43242bf887443beda3b7c5f9226d99bfd78f7611d4c56f513a118ceaec5d3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab_realdata_hub-0.1.0.tar.gz:

Publisher: publish.yml on bensonlee5/tab-realdata-hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tab_realdata_hub-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tab_realdata_hub-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8afcb7911f59343db64bde70393886a0a08f0dc9dde78be4b95d8f60558dd4c4
MD5 eececaa684a3eb0646423e876eb81ab5
BLAKE2b-256 207ee9ec231449e07bb624b5e6a55c7bf8b4c72204a26c1ef45cf36226e2af67

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab_realdata_hub-0.1.0-py3-none-any.whl:

Publisher: publish.yml on bensonlee5/tab-realdata-hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page