Skip to main content

Synthetic datasets for ML benchmarking with controllable complexity, configurable corruptions, and full provenance.

Project description

synthbench

synthbench is a small Python library for generating synthetic datasets that are actually useful for benchmarking. You control the signal complexity, add noise or missing data on top, and get back a dataset with full provenance so you know exactly what you generated and why. Every result is reproducible from a single integer seed.

It covers eight DGP families, five corruptors, metadata enrichment (Bayes error, effective rank), Parquet/CSV serialization, and sweep helpers for running ablation grids.

Installation

pip install synthbench

For Parquet support:

pip install "synthbench[io]"

For RandomNeuralDGP (needs PyTorch):

pip install "synthbench[neural]"

Basic usage

from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor

pipeline = BenchPipeline(
    LinearDGP(complexity="medium", task_type="classification"),
    corruptors=[MissingDataCorruptor(proportion=0.1, mechanism="mar")],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)                     # (500, 10)
print(result.metadata["bayes_error"])     # empirical difficulty estimate
print(result.metadata["effective_rank"])  # feature space dimensionality

What it does

Data-generating processes — Linear, Polynomial, Tree, Friedman (variants 1/2/3), Additive, Sparse, Geometric, and RandomNeural. Each takes a complexity parameter and records ground-truth feature importances alongside the data.

Corruptors — MeasurementNoise, Outlier, MissingData, Collinearity, and Categorical corruptors for the feature matrix, plus LabelNoiseCorruptor for flipping labels or injecting regression noise. They chain together in a canonical order and track how much signal they degrade.

Metadata — every result carries bayes_error, effective_rank, corruptor parameters, and version provenance. Enough to reconstruct the generating pipeline from scratch.

Sweepsseverity_sweep and difficulty_sweep for single-axis ablations, and experiment_grid for full factorial runs across sample size, complexity, and severity. Seeds are derived hierarchically so cells are independent but deterministic.

Named suitesBenchSuite("easy-classification").run() returns a labelled dict of results for a curated collection. Good for quick sanity checks or as a shared benchmark baseline.

Serializationto_parquet / from_parquet and to_csv / from_csv round-trip everything including metadata. BenchPipeline.from_metadata reconstructs and re-runs the pipeline for bit-identical replay.

Ablation example

from synthbench import LinearDGP, OutlierCorruptor, experiment_grid

grid = experiment_grid(
    LinearDGP,
    OutlierCorruptor,
    n_samples_list=[200, 500, 1000],
    complexities=["low", "medium", "high"],
    severities=["low", "medium", "high"],
    n_features=10,
    random_state=0,
    task_type="classification",
)

result = grid[(500, "high", "medium")]
print(result.metadata["bayes_error"])

Docs

Full reference at JanTeichertKluge.github.io/synth-bench.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthbench-0.1.0.tar.gz (549.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthbench-0.1.0-py3-none-any.whl (55.4 kB view details)

Uploaded Python 3

File details

Details for the file synthbench-0.1.0.tar.gz.

File metadata

  • Download URL: synthbench-0.1.0.tar.gz
  • Upload date:
  • Size: 549.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for synthbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 18adf34527464cf3375cb356b86ed6d7f80eef53bce5fe44733a12e8ddb12b88
MD5 cab29f83425b60e7b9d22b4032e9eb28
BLAKE2b-256 4d52bff76ed2061bd760741830c4f09fa6d9cfbe8281bacd62782c72ba678412

See more details on using hashes here.

Provenance

The following attestation bundles were made for synthbench-0.1.0.tar.gz:

Publisher: publish.yml on JanTeichertKluge/synth-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file synthbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: synthbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for synthbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 801d0aa36ef6290ca4ff7c1298a094b058cff83311cd970cf6e1eed2b4872b3d
MD5 c230c374243e2f4420d4ded321790b2f
BLAKE2b-256 f64fbcfb44253c7951f6a5dd2b671bcce87d4e402397f21403c25733f0e12143

See more details on using hashes here.

Provenance

The following attestation bundles were made for synthbench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on JanTeichertKluge/synth-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page