A schema and toolkit for curating tabular datasets and benchmarking tasks (the data layer behind TabArena).
Project description
Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets
| ๐ Examples | ๐งโ๐ฌ Contribute a Dataset | ๐ Paper (placeholder โ coming soon) |
|---|
Data Foundry is the data layer behind the next generation of TabArena datasets. It provides:
- A small, opinionated schema for tabular datasets, tasks (IID / temporal non-IID / grouped non-IID), and outer CV splits โ aligned with OpenML where possible, extended where it had to be.
- A curation toolkit (sanity checks, recommended-split helpers, dtype-preserving save/load) so a curator turns a raw download into a reproducible artifact in one notebook.
- A collections API that pins datasets (defined by
(unique_name, uuid)) to immutable curated containers and resolves them against a local warehouse or directly against the BeyondArena Datasets.
โก Quickstart
[!TIP] Pull a real curated dataset from BeyondArena and inspect its full metadata + outer CV splits. The first call fetches from Hugging Face; subsequent calls hit your local cache.
pip install data-foundry
python examples/load_curated_container.py
from data_foundry.collections import BEYOND_ARENA
container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
print(container.describe()) # full identity + dtypes + task + splits
print(container.dataset.shape) # the actual DataFrame
print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
That's the whole API surface in three lines. See examples/benchmark_on_beyond_arena.py for benchmarking Random Forest on the data!
๐น๏ธ Use Cases
๐งช Inspect a curated container offline โ no Hugging Face download required
The package ships a toy CuratedContainer so you can poke at the full API โ schema, dtypes, splits, describe() โ without touching the network. Identical interface to a downloaded BeyondArena container.
from data_foundry.curation_container import CuratedContainer
from data_foundry.examples import get_toy_container_path
container = CuratedContainer.load(get_toy_container_path())
print(container.describe()) # full identity + dtypes + task + splits
print(container.dataset.shape) # the actual DataFrame
print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
Full inspection script (every metadata field printed): examples/load_curated_container.py.
๐ฆ Use one dataset โ IID and non-IID variants
Download a single BeyondArena container by name (or UUID) and iterate its outer CV splits. The collection resolves the container against your local cache; subsequent runs hit disk, not the network.
from data_foundry.collections import BEYOND_ARENA
container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
df = container.dataset
target = container.task_metadata.target_column_name
for repeat_id, folds in container.experiment_metadata.splits.items():
for fold_id, (train_idx, test_idx) in folds.items():
X_train, y_train = df.iloc[train_idx].drop(columns=target), df.iloc[train_idx][target]
X_test, y_test = df.iloc[test_idx].drop(columns=target), df.iloc[test_idx][target]
# ... fit, evaluate ...
Full worked example (Random Forest, RMSE per fold, full metadata via container.describe()): examples/benchmark_on_beyond_arena.py.
Split regimes. BeyondArena ships datasets from three regimes โ which one a dataset is in shows up directly on task_metadata:
| Regime | Set on PredictiveMLTaskMetadata |
Meaning |
|---|---|---|
| IID | neither time_on nor group_on |
rows are independent; random / stratified splits |
| temporal non-IID | time_on set |
rows ordered in time; future rows must not leak backwards |
| grouped non-IID | group_on set (+ group_labels) |
all rows of a group stay together in one fold |
Side-by-side regime printout (one IID, two grouped variants โ per_group vs per_sample โ and one temporal): examples/data_foundry_data_regimes.py.
๐๏ธ Use a collection of datasets โ pre-download all of BeyondArena
BEYOND_ARENA.prefetch(...) batches every container into a single Hugging Face snapshot_download call (one network round-trip for the whole collection). On a warm cache it skips importing huggingface_hub entirely.
from data_foundry.collections import BEYOND_ARENA
paths = BEYOND_ARENA.prefetch() # warms the cache once
for container in BEYOND_ARENA.iter_containers(): # now hits disk only
print(container.dataset_metadata.unique_name, container.dataset.shape)
Cache management:
BEYOND_ARENA.clear_cache() # nuke this collection's subdir
BEYOND_ARENA.get_dataset(name, force_download=True) # re-fetch a single container
Full worked example with tqdm progress + checksum verification: examples/download_all_beyond_arena_datasets.py. For a single dataset round-trip with checksum verification, see examples/download_beyond_arena_dataset.py.
๐งโ๐ฌ Curate a dataset โ turn a raw download into a CuratedContainer
End-to-end pipeline, condensed (the full runnable version is examples/curate_a_dataset.py):
from data_foundry.schema import DatasetMetadata, PredictiveMLTaskMetadata
# --- Basic metadata
dataset_mold = DatasetMetadata(
unique_name="blood_transfusion",
dataset_year="2008",
domain_str="medical & healthcare",
dataset_source="UCI",
original_dataset_source_download_link="https://doi.org/10.24432/C5GS39",
download_description="""
We download the data from the UCI repository and unzip it to a predefined folder.
mkdir -p local-data-warehouse/blood_transfusion/ \\
&& wget -P local-data-warehouse/blood_transfusion/ \\
https://archive.ics.uci.edu/static/public/176/blood+transfusion+service+center.zip \\
&& unzip local-data-warehouse/blood_transfusion/blood+transfusion+service+center.zip \\
-d local-data-warehouse/blood_transfusion/
""",
academic_reference_bibtex="""@article{yeh2009knowledge,
title={Knowledge discovery on RFM model using Bernoulli sequence},
author={Yeh, I-Cheng and Yang, King-Jang and Ting, Tao-Ming},
journal={Expert Systems with applications},
volume={36}, number={3}, pages={5866--5871},
year={2009}, publisher={Elsevier},
}
""",
academic_reference_bibtex_key="yeh2009knowledge",
license="CC BY 4.0",
data_tags=["IID"],
curation_comments="Renamed features for clarity; mapped target 0/1 โ No/Yes; ~29% duplicate rows kept.",
)
task_mold = PredictiveMLTaskMetadata(
target_column_name="DonatedBloodInMarch2007",
problem_type="binary_classification",
objective_metric_name="roc_auc",
stratify_on="DonatedBloodInMarch2007",
)
# --- Preprocessing
import pandas as pd
df = pd.read_csv(f"{dataset_mold.path}/transfusion.data")
df.columns = [
"MonthsSinceLastDonation", "NumberOfDonations", "TotalBloodDonated",
"MonthsSinceFirstDonation", "DonatedBloodInMarch2007",
]
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].map({1: "Yes", 0: "No"})
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].astype("category")
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
# --- Sanity checks
from data_foundry import dataset_checks
df_head, summary, numeric_stats, cat_stats, target_df = dataset_checks.run_all_checks(
data=df,
target_feature=task_mold.target_column_name,
problem_type=task_mold.problem_type,
)
# --- Outer CV splits
from data_foundry.curation_recommendations import (
get_recommended_iid_splits,
get_recommended_splits_dimensions,
)
n_repeats, n_splits, test_size = get_recommended_splits_dimensions(dataset=df)
splits = get_recommended_iid_splits(
dataset=df,
n_repeats=n_repeats,
n_splits=n_splits,
test_size=test_size,
stratify_on=task_mold.stratify_on,
)
# --- Split metadata + container
from data_foundry.schema import PredictiveMLSplitsMetadata
from data_foundry.curation_container import CuratedContainer
splits_mold = PredictiveMLSplitsMetadata(
splits_comment="Default splits for IID data.",
splits=splits,
)
curated_data = CuratedContainer(
dataset=df,
dataset_metadata=dataset_mold,
task_metadata=task_mold,
experiment_metadata=splits_mold,
)
curated_data.save()
print(curated_data.uuid, curated_data.checksum)
For the contributor flow (where to put the notebook, how to open the PR, the /new-dataset Claude Code skill, best practices around versioning, anomaly tracking, and dtype handling), see CONTRIBUTING_DATASETS.md.
๐ช Installation
[!IMPORTANT] Requires Python 3.10+.
๐ฆ From PyPI โ use Data Foundry as a library
pip install data-foundry
๐ฑ From source โ clone and install editable
git clone https://github.com/TabArena/data-foundry.git
cd data-foundry
uv pip install -e .
๐ ๏ธ Developer setup โ extras for curation, tests, and tooling
git clone https://github.com/TabArena/data-foundry.git
cd data-foundry
uv pip install -e ".[dev,tests]"
pytest # run the test suite
ruff check . && ruff format --check . # lint + format
The dev extra adds curation-time deps (openml, kaggle, seaborn, polars, etc.); tests adds pytest and scikit-learn (needed for the recommended-split helpers and examples).
๐๏ธ Repository Structure
data-foundry/
โโโ src/data_foundry/ # the package โ schema, container, collections, checks, splits
โ โโโ schema.py # DatasetMetadata, PredictiveMLTaskMetadata, PredictiveMLSplitsMetadata
โ โโโ curation_container.py # CuratedContainer (save/load + describe + checksum)
โ โโโ collections/ # BEYOND_ARENA, DatasetCollection, HuggingFaceSource, cache helpers
โ โโโ curation_recommendations.py # recommended split helpers (IID, grouped, temporal)
โ โโโ dataset_checks.py # run_all_checks(...) โ sanity stats for the curation notebook
โ โโโ examples/toy_container/ # tiny ready-to-load CuratedContainer shipped in-package
โโโ datasets/ # curation notebooks
โ โโโ _template/ # canonical notebook skeleton
โ โโโ _dev/ # contributions land here first
โ โโโ _maintenance/ # re-runs / fixes for already-released datasets
โ โโโ beyond_iid/ # promoted datasets โ pinned by `final_uuid_list.py`
โโโ examples/ # runnable demos (covers the use-cases above)
โโโ scripts/ # one-off tooling (toy container builder)
โ โโโ beyond_arena/ # BeyondArena-specific scripts and outputs (warehouse stats, plots)
โโโ tests/ # pytest test suite
โโโ local-data-warehouse/ # gitignored โ curators write raw + saved containers here
๐งโ๐ฌ Contributing a Dataset
The short version:
- Copy
datasets/_template/_template.ipynbtodatasets/_dev/<topic>/<unique_name>/<unique_name>.ipynb. - Run the notebook end-to-end so the saved cells contain populated check
tables and the final
uuid/checksum. - Open a PR โ reviewers will move the notebook into the right
beyond_iid/subfolder and append the UUID todatasets/beyond_iid/final_uuid_list.py.
The long version (field-by-field walkthrough, split-helper choice, dtype
gotchas, the /new-dataset Claude Code scaffolding skill): see
CONTRIBUTING_DATASETS.md.
๐ Citation
PLACEHOLDER
PLACEHOLDER
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_foundry-0.0.5.dev20260626103926.tar.gz.
File metadata
- Download URL: data_foundry-0.0.5.dev20260626103926.tar.gz
- Upload date:
- Size: 47.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f02e6ae1c717ccb6e941cb611c18bde00682f210fc7fd77713676d890080962c
|
|
| MD5 |
f8c6dcdd0d7532a47dec009e1a003aa4
|
|
| BLAKE2b-256 |
835d880c6578830f4cee1c8bdb2e16ada9dabe383ab1a9b203967f673ce4f33e
|
File details
Details for the file data_foundry-0.0.5.dev20260626103926-py3-none-any.whl.
File metadata
- Download URL: data_foundry-0.0.5.dev20260626103926-py3-none-any.whl
- Upload date:
- Size: 53.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
287c9ecfe0339231e53835dbd4ca131ca3ffd5afef18a1dd01caaed7ce13d3ce
|
|
| MD5 |
becd395228a3c506af0d26ea3ea244a6
|
|
| BLAKE2b-256 |
d3a0431b85cfd50541370ad72f13450e11d40298cc5457c8c44ace7473ab1aef
|