A schema and toolkit for curating tabular datasets and benchmarking tasks (the data layer behind TabArena).

These details have not been verified by PyPI

Project links

Project description

Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets

📂 Examples	🧑‍🔬 Contribute a Dataset	📄 Paper (placeholder — coming soon)

Data Foundry is the data layer behind the next generation of TabArena datasets. It provides:

A small, opinionated schema for tabular datasets, tasks (IID / temporal non-IID / grouped non-IID), and outer CV splits — aligned with OpenML where possible, extended where it had to be.
A curation toolkit (sanity checks, recommended-split helpers, dtype-preserving save/load) so a curator turns a raw download into a reproducible artifact in one notebook.
A collections API that pins datasets (defined by (unique_name, uuid)) to immutable curated containers and resolves them against a local warehouse or directly against the BeyondArena Datasets.
A git-native curation log + dashboard — the dataset backlog lives as one markdown record per candidate dataset under curation/records/, edited locally through a Sheets-like dashboard (data-foundry-curation serve) with a built-in Guidelines tab, and published as a read-only public site on GitHub Pages. It replaces the old curation spreadsheet; a new dataset is added simply by creating a markdown file.

⚡ Quickstart

[!TIP] Pull a real curated dataset from BeyondArena and inspect its full metadata + outer CV splits. The first call fetches from Hugging Face; subsequent calls hit your local cache.

pip install data-foundry
python examples/load_curated_container.py

from data_foundry.collections import BEYOND_ARENA

container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
print(container.describe())          # full identity + dtypes + task + splits
print(container.dataset.shape)       # the actual DataFrame
print(container.task_metadata.split_regime)  # "iid", "temporal_non_iid", or "grouped_non_iid"

That's the whole API surface in three lines. See examples/benchmark_on_beyond_arena.py for benchmarking Random Forest on the data!

🕹️ Use Cases

🧪 Inspect a curated container offline — no Hugging Face download required

The package ships a toy CuratedContainer so you can poke at the full API — schema, dtypes, splits, describe() — without touching the network. Identical interface to a downloaded BeyondArena container.

from data_foundry.curation_container import CuratedContainer
from data_foundry.examples import get_toy_container_path

container = CuratedContainer.load(get_toy_container_path())
print(container.describe())          # full identity + dtypes + task + splits
print(container.dataset.shape)       # the actual DataFrame
print(container.task_metadata.split_regime)  # "iid", "temporal_non_iid", or "grouped_non_iid"

Full inspection script (every metadata field printed): examples/load_curated_container.py.

📦 Use one dataset — IID and non-IID variants

Download a single BeyondArena container by name (or UUID) and iterate its outer CV splits. The collection resolves the container against your local cache; subsequent runs hit disk, not the network.

from data_foundry.collections import BEYOND_ARENA

container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
df = container.dataset
target = container.task_metadata.target_column_name

for repeat_id, folds in container.experiment_metadata.splits.items():
    for fold_id, (train_idx, test_idx) in folds.items():
        X_train, y_train = df.iloc[train_idx].drop(columns=target), df.iloc[train_idx][target]
        X_test,  y_test  = df.iloc[test_idx].drop(columns=target),  df.iloc[test_idx][target]
        # ... fit, evaluate ...

Full worked example (Random Forest, RMSE per fold, full metadata via container.describe()): examples/benchmark_on_beyond_arena.py.

Split regimes. BeyondArena ships datasets from three regimes — which one a dataset is in shows up directly on task_metadata:

Regime	Set on `PredictiveMLTaskMetadata`	Meaning
IID	neither `time_on` nor `group_on`	rows are independent; random / stratified splits
temporal non-IID	`time_on` set	rows ordered in time; future rows must not leak backwards
grouped non-IID	`group_on` set (+ `group_labels`)	all rows of a group stay together in one fold

Side-by-side regime printout (one IID, two grouped variants — per_group vs per_sample — and one temporal): examples/data_foundry_data_regimes.py.

🗂️ Use a collection of datasets — pre-download all of BeyondArena

BEYOND_ARENA.prefetch(...) batches every container into a single Hugging Face snapshot_download call (one network round-trip for the whole collection). On a warm cache it skips importing huggingface_hub entirely.

from data_foundry.collections import BEYOND_ARENA

paths = BEYOND_ARENA.prefetch()          # warms the cache once
for container in BEYOND_ARENA.iter_containers():  # now hits disk only
    print(container.dataset_metadata.unique_name, container.dataset.shape)

Cache management:

BEYOND_ARENA.clear_cache()                 # nuke this collection's subdir
BEYOND_ARENA.get_dataset(name, force_download=True)  # re-fetch a single container

Full worked example with tqdm progress + checksum verification: examples/download_all_beyond_arena_datasets.py. For a single dataset round-trip with checksum verification, see examples/download_beyond_arena_dataset.py.

🧑‍🔬 Curate a dataset — turn a raw download into a CuratedContainer

End-to-end pipeline, condensed (the full runnable version is examples/curate_a_dataset.py):

from data_foundry.schema import DatasetMetadata, PredictiveMLTaskMetadata

# --- Basic metadata
dataset_mold = DatasetMetadata(
    unique_name="blood_transfusion",
    dataset_year="2008",
    domain_str="medical & healthcare",
    dataset_source="UCI",
    original_dataset_source_download_link="https://doi.org/10.24432/C5GS39",
    download_description="""
We download the data from the UCI repository and unzip it to a predefined folder.

mkdir -p local-data-warehouse/blood_transfusion/ \\
  && wget -P local-data-warehouse/blood_transfusion/ \\
       https://archive.ics.uci.edu/static/public/176/blood+transfusion+service+center.zip \\
  && unzip local-data-warehouse/blood_transfusion/blood+transfusion+service+center.zip \\
       -d local-data-warehouse/blood_transfusion/
""",
    academic_reference_bibtex="""@article{yeh2009knowledge,
  title={Knowledge discovery on RFM model using Bernoulli sequence},
  author={Yeh, I-Cheng and Yang, King-Jang and Ting, Tao-Ming},
  journal={Expert Systems with applications},
  volume={36}, number={3}, pages={5866--5871},
  year={2009}, publisher={Elsevier},
}
""",
    academic_reference_bibtex_key="yeh2009knowledge",
    license="CC BY 4.0",
    data_tags=["IID"],
    curation_comments="Renamed features for clarity; mapped target 0/1 → No/Yes; ~29% duplicate rows kept.",
)
task_mold = PredictiveMLTaskMetadata(
    target_column_name="DonatedBloodInMarch2007",
    problem_type="binary_classification",
    objective_metric_name="roc_auc",
    stratify_on="DonatedBloodInMarch2007",
)

# --- Preprocessing
import pandas as pd
df = pd.read_csv(f"{dataset_mold.path}/transfusion.data")
df.columns = [
    "MonthsSinceLastDonation", "NumberOfDonations", "TotalBloodDonated",
    "MonthsSinceFirstDonation", "DonatedBloodInMarch2007",
]
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].map({1: "Yes", 0: "No"})
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].astype("category")
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# --- Sanity checks
from data_foundry import dataset_checks
df_head, summary, numeric_stats, cat_stats, target_df = dataset_checks.run_all_checks(
    data=df,
    target_feature=task_mold.target_column_name,
    problem_type=task_mold.problem_type,
)

# --- Outer CV splits
from data_foundry.curation_recommendations import (
    get_recommended_iid_splits,
    get_recommended_splits_dimensions,
)

n_repeats, n_splits, test_size = get_recommended_splits_dimensions(dataset=df)
splits = get_recommended_iid_splits(
    dataset=df,
    n_repeats=n_repeats,
    n_splits=n_splits,
    test_size=test_size,
    stratify_on=task_mold.stratify_on,
)

# --- Split metadata + container
from data_foundry.schema import PredictiveMLSplitsMetadata
from data_foundry.curation_container import CuratedContainer

splits_mold = PredictiveMLSplitsMetadata(
    splits_comment="Default splits for IID data.",
    splits=splits,
)
curated_data = CuratedContainer(
    dataset=df,
    dataset_metadata=dataset_mold,
    task_metadata=task_mold,
    experiment_metadata=splits_mold,
)
curated_data.save()
print(curated_data.uuid, curated_data.checksum)

For the contributor flow (where to put the notebook, how to open the PR, the /new-dataset Claude Code skill, best practices around versioning, anomaly tracking, and dtype handling), see CONTRIBUTING_DATASETS.md.

🗂️ Triage the dataset backlog — the curation log + local dashboard

Before a dataset becomes a notebook, it lives in the curation log: one markdown record per candidate dataset under curation/records/. Each <unique_name>.md has YAML front-matter (the structured / dropdown fields) plus a free-text body (## Comments, ## Reference). Add or triage a dataset by creating/editing its file — by hand, with an agent, or through the dashboard.

pip install "data-foundry[curation]"   # or the editable dev install
data-foundry-curation serve            # → http://127.0.0.1:8765

The local, Sheets-like dashboard edits those records in place (filter by status, pin a working row, add dropdown options, …) and ships a built-in Guidelines tab describing the selection criteria and processing conventions. Other CLI subcommands:

data-foundry-curation validate                 # check records against the dropdown vocab
data-foundry-curation export --format xlsx out.xlsx   # flat snapshot (csv|parquet|xlsx|gsheet)
data-foundry-curation build-site site/          # read-only static copy (e.g. GitHub Pages)
data-foundry-curation import-sheet sheet.csv    # one-time migration from the old Google Sheet

Browse it online. A read-only copy of the backlog is published to GitHub Pages — tabarena.github.io/data-foundry — and regenerated automatically from curation/records/ on every push to main (no install, no network round-trip to Hugging Face; search, sort, filter, and pin all run in the browser). Note this makes every record's comments, reviewer names, and decision notes public.

Working with Claude Code? The /curate slash command starts the dashboard and loads the curation guidelines into context so the agent can help you decide and process datasets.

🪄 Installation

[!IMPORTANT] Requires Python 3.10+.

📦 From PyPI — use Data Foundry as a library

pip install data-foundry

🌱 From source — clone and install editable

git clone https://github.com/TabArena/data-foundry.git
cd data-foundry
uv pip install -e .

🛠️ Developer setup — extras for curation, tests, and tooling

git clone https://github.com/TabArena/data-foundry.git
cd data-foundry
uv pip install -e ".[dev,tests]"
pytest                                 # run the test suite
ruff check . && ruff format --check .  # lint + format

The dev extra adds curation-time deps (openml, kaggle, seaborn, polars, etc.); tests adds pytest and scikit-learn (needed for the recommended-split helpers and examples).

🗂️ Repository Structure

data-foundry/
├── src/data_foundry/         # the package — schema, container, collections, checks, splits
│   ├── schema.py             # DatasetMetadata, PredictiveMLTaskMetadata, PredictiveMLSplitsMetadata
│   ├── curation_container.py # CuratedContainer (save/load + describe + checksum)
│   ├── collections/          # BEYOND_ARENA, DatasetCollection, HuggingFaceSource, cache helpers
│   ├── curation_recommendations.py  # recommended split helpers (IID, grouped, temporal)
│   ├── dataset_checks.py     # run_all_checks(...) — sanity stats for the curation notebook
│   ├── curation/             # curation log toolkit — CurationRecord, store, dashboard (serve), import/export, build-site
│   └── examples/toy_container/  # tiny ready-to-load CuratedContainer shipped in-package
├── curation/                 # the curation log (git-tracked data) — records/*.md + vocabularies.yaml
├── datasets/                 # curation notebooks
│   ├── _template/            # canonical notebook skeleton
│   ├── _dev/                 # contributions land here first
│   ├── _maintenance/         # re-runs / fixes for already-released datasets
│   └── beyond_iid/           # promoted datasets — pinned by `final_uuid_list.py`
├── examples/                 # runnable demos (covers the use-cases above)
├── scripts/                  # one-off tooling (toy container builder)
│   └── beyond_arena/         # BeyondArena-specific scripts and outputs (warehouse stats, plots)
├── tests/                    # pytest test suite
└── local-data-warehouse/     # gitignored — curators write raw + saved containers here

🧑‍🔬 Contributing a Dataset

The short version:

Copy datasets/_template/_template.ipynb to datasets/_dev/<topic>/<unique_name>/<unique_name>.ipynb.
Run the notebook end-to-end so the saved cells contain populated check tables and the final uuid / checksum.
Open a PR — reviewers will move the notebook into the right beyond_iid/ subfolder and append the UUID to datasets/beyond_iid/final_uuid_list.py.

The long version (field-by-field walkthrough, split-helper choice, dtype gotchas, the /new-dataset Claude Code scaffolding skill): see CONTRIBUTING_DATASETS.md.

📄 Citation

PLACEHOLDER

PLACEHOLDER

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.6.dev20260630071820 pre-release

Jun 30, 2026

0.0.6.dev20260629152724 pre-release

Jun 29, 2026

0.0.6.dev20260628084548 pre-release

Jun 28, 2026

This version

0.0.6.dev20260628084422 pre-release

Jun 28, 2026

0.0.6.dev20260628083810 pre-release

Jun 28, 2026

0.0.6.dev20260626104319 pre-release yanked

Jun 26, 2026

Reason this release was yanked:

Wrong version

0.0.5

Jun 26, 2026

0.0.5.dev20260626103926 pre-release

Jun 26, 2026

0.0.5.dev20260619140225 pre-release

Jun 19, 2026

0.0.4

Jun 19, 2026

0.0.4.dev20260528164513 pre-release

May 28, 2026

0.0.3

May 28, 2026

0.0.2

May 27, 2026

0.0.1

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_foundry-0.0.6.dev20260628084422.tar.gz (87.0 kB view details)

Uploaded Jun 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_foundry-0.0.6.dev20260628084422-py3-none-any.whl (98.6 kB view details)

Uploaded Jun 28, 2026 Python 3

File details

Details for the file data_foundry-0.0.6.dev20260628084422.tar.gz.

File metadata

Download URL: data_foundry-0.0.6.dev20260628084422.tar.gz
Upload date: Jun 28, 2026
Size: 87.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for data_foundry-0.0.6.dev20260628084422.tar.gz
Algorithm	Hash digest
SHA256	`e7776d79edeabb5c00add7dea576b9b2c3d23d8a5c72b0874f62e07ea261909f`
MD5	`6698e4d104795dbb127d94df077e2b2f`
BLAKE2b-256	`0b71b4aa7818785abbdc1eaad53fea81dee6a52fb73824c5ff732b20479c45b6`

See more details on using hashes here.

File details

Details for the file data_foundry-0.0.6.dev20260628084422-py3-none-any.whl.

File metadata

Download URL: data_foundry-0.0.6.dev20260628084422-py3-none-any.whl
Upload date: Jun 28, 2026
Size: 98.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for data_foundry-0.0.6.dev20260628084422-py3-none-any.whl
Algorithm	Hash digest
SHA256	`21432ef77ee4854e94f5adbba43ef17dfdc0d4731b251f2c0e2659a7a909dcd2`
MD5	`7aaf3195b3d34353f403aa9a26df2295`
BLAKE2b-256	`553732ea32140df1da6f65eb75b4df8ba4898713c60c207d06abd7d2142783ac`

See more details on using hashes here.

data-foundry 0.0.6.dev20260628084422

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets

⚡ Quickstart

🕹️ Use Cases

🪄 Installation

🗂️ Repository Structure

🧑‍🔬 Contributing a Dataset

📄 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes