A contamination-proof tabular ML benchmark — drop-in replacement for TabArena with procedurally generated synthetic datasets
Project description
tabular-bank
A contamination-proof tabular ML benchmark — drop-in replacement for TabArena with procedurally generated synthetic datasets.
Why tabular-bank?
TabArena is the leading benchmark for tabular ML models, but it uses real-world datasets that may be contaminated in LLM/foundation model training data. tabular-bank solves this by generating datasets procedurally from a secret seed — the repo contains only the generation engine. No dataset-specific information is ever committed.
Anti-Contamination Architecture
- Procedural structure: Feature specs, DAG topology, mechanism families, coefficients, and noise models are generated from the seed
- Cryptographic seed derivation: HMAC-SHA256 ensures datasets are unpredictable without the master secret
- Rotating benchmark rounds: Each round uses a fresh seed; past rounds' seeds are published after expiry
- Auditable fairness: All generation code is public — anyone can verify the engine is unbiased
Installation
pip install tabular-bank
# With TabArena integration for official benchmarking
pip install "tabular-bank[benchmark]"
Quick Start
Generate Datasets
# Via CLI
tabular-bank generate --round round-001 --secret "your-secret" --n-scenarios 10
# Via Python
from tabular_bank.generation.generate import generate_all
generate_all(master_secret="your-secret", round_id="round-001", n_scenarios=10)
Run a Benchmark
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from tabular_bank.context import TabularBankContext
from tabular_bank.runner import run_benchmark
from tabular_bank.leaderboard import generate_leaderboard, format_leaderboard
# Models to benchmark
models = {
"GBM": GradientBoostingClassifier(n_estimators=100),
"RF": RandomForestClassifier(n_estimators=100),
}
# Run benchmark
result = run_benchmark(
models=models,
round_id="round-001",
master_secret="your-secret",
)
# Generate leaderboard
leaderboard = generate_leaderboard(result)
print(format_leaderboard(leaderboard))
Inspect Datasets
tabular-bank info --round round-001
You can also set TABULAR_BANK_SECRET and TABULAR_BANK_CACHE in the environment.
Legacy SYNTHETIC_TAB_SECRET / SYNTHETIC_TAB_CACHE names are still accepted.
Architecture
flowchart TD
A["Secret + Round ID"] --> B["HMAC-SHA256"]
B --> C["Round Seed"]
C --> D["Scenario Sampler"]
D --> E["Scenario Config\n(problem type, features, difficulty,\nmissing values, imbalance, etc.)"]
E --> FS["Feature Seed"]
E --> DS["DAG Seed"]
E --> AS["Data Seed"]
E --> SS["Split Seed"]
FS --> FG["Feature Generator"]
FG --> FO["Names · Types · Distributions"]
DS --> DB["DAG Builder"]
DB --> DO["Causal Graph · Sampled Mechanisms\n(spline, tanh, interaction, etc.)"]
AS --> SM["Sampler"]
SM --> SO["Tabular DataFrame\n+ Heteroscedastic Residuals"]
SS --> SG["Split Generator"]
SG --> SGO["Cross-Validation Folds\n(10 repeats × 3 folds)"]
Parametric Scenario Sampling
Rather than fixed hand-crafted templates, tabular-bank samples all scenario parameters from a continuous space (CausalProfiler-inspired coverage guarantee). Any valid configuration has non-zero probability of being generated, producing diverse, non-redundant benchmark tasks.
Sampled axes include:
- Problem type: binary classification, multiclass, regression
- Feature count, sample size, categorical ratio
- Difficulty: noise scale, nonlinearity probability, interaction probability, heteroscedastic noise probability, DAG edge density
- DAG complexity: confounder count and strength, max parent count
- Missing values: rate and mechanism (MCAR / MAR / MNAR)
- Class imbalance ratio (binary tasks)
- Temporal autocorrelation in root features
- Root feature correlations (multivariate Gaussian)
Edges no longer draw from a tiny fixed "form" enum alone. Each edge samples a structured mechanism specification, with families including linear, threshold, sigmoid, tanh, piecewise-linear, sinusoidal, spline, and interaction effects. Non-root nodes can also sample heteroscedastic residual noise models whose variance depends on one of their parents.
from tabular_bank.generation.engine import generate_sampled_datasets
datasets = generate_sampled_datasets(
master_secret="your-secret",
round_id="round-001",
n_scenarios=20,
)
TabArena Compatibility
tabular-bank is designed as a drop-in replacement for TabArena. Generated datasets can be converted to TabArena's UserTask format for use with TabArena's full evaluation pipeline (8-fold bagging, standardized HPO, ELO leaderboards).
ctx = TabularBankContext(round_id="round-001", master_secret="your-secret")
tabarena_tasks = ctx.get_tabarena_tasks() # Requires tabarena package
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabular_bank-0.1.0.tar.gz.
File metadata
- Download URL: tabular_bank-0.1.0.tar.gz
- Upload date:
- Size: 42.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf10196b05eed86b0f31654422d439a22361f7c060176ebed88361aeb7f44766
|
|
| MD5 |
adb2dc4059b28c0cb1855738a051a477
|
|
| BLAKE2b-256 |
ffb35cd6b61036ac53c4c7175b9c65b0d9a09e3ed92c00ce43b325245d2a5477
|
Provenance
The following attestation bundles were made for tabular_bank-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on jxucoder/tabular-bank
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tabular_bank-0.1.0.tar.gz -
Subject digest:
bf10196b05eed86b0f31654422d439a22361f7c060176ebed88361aeb7f44766 - Sigstore transparency entry: 1072564856
- Sigstore integration time:
-
Permalink:
jxucoder/tabular-bank@af163aa05841e880ee12a86c59e6c01d00abb545 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jxucoder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@af163aa05841e880ee12a86c59e6c01d00abb545 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tabular_bank-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tabular_bank-0.1.0-py3-none-any.whl
- Upload date:
- Size: 41.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f9987abfdfcecfbcb7ac352bf929bd4d84c0d449ed60068a012f797cea2cd38
|
|
| MD5 |
014e24c9f8066f4195f07e8b2507b807
|
|
| BLAKE2b-256 |
848b8a555bb4e270ff2d9f05c2099402181fa09637a6569a67f5ebc39e95c38c
|
Provenance
The following attestation bundles were made for tabular_bank-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on jxucoder/tabular-bank
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tabular_bank-0.1.0-py3-none-any.whl -
Subject digest:
6f9987abfdfcecfbcb7ac352bf929bd4d84c0d449ed60068a012f797cea2cd38 - Sigstore transparency entry: 1072564865
- Sigstore integration time:
-
Permalink:
jxucoder/tabular-bank@af163aa05841e880ee12a86c59e6c01d00abb545 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jxucoder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@af163aa05841e880ee12a86c59e6c01d00abb545 -
Trigger Event:
push
-
Statement type: