Synthetic datasets for ML benchmarking with controllable complexity, configurable corruptions, and full provenance.
Project description
synthbench is a small Python library for generating synthetic datasets that are actually useful for benchmarking. You control the signal complexity, add noise or missing data on top, and get back a dataset with full provenance so you know exactly what you generated and why. Every result is reproducible from a single integer seed.
It covers eight DGP families, five corruptors, metadata enrichment (Bayes error, effective rank), Parquet/CSV serialization, and sweep helpers for running ablation grids.
Installation
pip install synthbench
For Parquet support:
pip install "synthbench[io]"
For RandomNeuralDGP (needs PyTorch):
pip install "synthbench[neural]"
Basic usage
from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor
pipeline = BenchPipeline(
LinearDGP(complexity="medium", task_type="classification"),
corruptors=[MissingDataCorruptor(proportion=0.1, mechanism="mar")],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)
print(result.X.shape) # (500, 10)
print(result.metadata["bayes_error"]) # empirical difficulty estimate
print(result.metadata["effective_rank"]) # feature space dimensionality
What it does
Data-generating processes — Linear, Polynomial, Tree, Friedman (variants 1/2/3), Additive, Sparse, Geometric, and RandomNeural. Each takes a complexity parameter and records ground-truth feature importances alongside the data.
Corruptors — MeasurementNoise, Outlier, MissingData, Collinearity, and Categorical corruptors for the feature matrix, plus LabelNoiseCorruptor for flipping labels or injecting regression noise. They chain together in a canonical order and track how much signal they degrade.
Metadata — every result carries bayes_error, effective_rank, corruptor parameters, and version provenance. Enough to reconstruct the generating pipeline from scratch.
Sweeps — severity_sweep and difficulty_sweep for single-axis ablations, and experiment_grid for full factorial runs across sample size, complexity, and severity. Seeds are derived hierarchically so cells are independent but deterministic.
Named suites — BenchSuite("easy-classification").run() returns a labelled dict of results for a curated collection. Good for quick sanity checks or as a shared benchmark baseline.
Serialization — to_parquet / from_parquet and to_csv / from_csv round-trip everything including metadata. BenchPipeline.from_metadata reconstructs and re-runs the pipeline for bit-identical replay.
Ablation example
from synthbench import LinearDGP, OutlierCorruptor, experiment_grid
grid = experiment_grid(
LinearDGP,
OutlierCorruptor,
n_samples_list=[200, 500, 1000],
complexities=["low", "medium", "high"],
severities=["low", "medium", "high"],
n_features=10,
random_state=0,
task_type="classification",
)
result = grid[(500, "high", "medium")]
print(result.metadata["bayes_error"])
Docs
Full reference at JanTeichertKluge.github.io/synth-bench.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synthbench-0.1.0.tar.gz.
File metadata
- Download URL: synthbench-0.1.0.tar.gz
- Upload date:
- Size: 549.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18adf34527464cf3375cb356b86ed6d7f80eef53bce5fe44733a12e8ddb12b88
|
|
| MD5 |
cab29f83425b60e7b9d22b4032e9eb28
|
|
| BLAKE2b-256 |
4d52bff76ed2061bd760741830c4f09fa6d9cfbe8281bacd62782c72ba678412
|
Provenance
The following attestation bundles were made for synthbench-0.1.0.tar.gz:
Publisher:
publish.yml on JanTeichertKluge/synth-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
synthbench-0.1.0.tar.gz -
Subject digest:
18adf34527464cf3375cb356b86ed6d7f80eef53bce5fe44733a12e8ddb12b88 - Sigstore transparency entry: 1448904393
- Sigstore integration time:
-
Permalink:
JanTeichertKluge/synth-bench@3227d43f534f0371024c9982e948ba6c84fea844 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/JanTeichertKluge
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3227d43f534f0371024c9982e948ba6c84fea844 -
Trigger Event:
push
-
Statement type:
File details
Details for the file synthbench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: synthbench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 55.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
801d0aa36ef6290ca4ff7c1298a094b058cff83311cd970cf6e1eed2b4872b3d
|
|
| MD5 |
c230c374243e2f4420d4ded321790b2f
|
|
| BLAKE2b-256 |
f64fbcfb44253c7951f6a5dd2b671bcce87d4e402397f21403c25733f0e12143
|
Provenance
The following attestation bundles were made for synthbench-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on JanTeichertKluge/synth-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
synthbench-0.1.0-py3-none-any.whl -
Subject digest:
801d0aa36ef6290ca4ff7c1298a094b058cff83311cd970cf6e1eed2b4872b3d - Sigstore transparency entry: 1448904459
- Sigstore integration time:
-
Permalink:
JanTeichertKluge/synth-bench@3227d43f534f0371024c9982e948ba6c84fea844 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/JanTeichertKluge
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3227d43f534f0371024c9982e948ba6c84fea844 -
Trigger Event:
push
-
Statement type: