Skip to main content

Spec-driven tabular data synthesis with reusable fitted bundles.

Project description

TabDat-Synth

banner

TabDat-Synth is a Python package for spec-driven synthetic tabular data generation. It is intended for research, education, benchmarking, and data-science prototyping where real tabular data is sensitive, unavailable, or inconvenient to share.

The package learns reusable synthesis artifacts from a source table, then generates new rows from an explicit directed acyclic graph (DAG) specification. It supports empirical sampling, conditioned categorical models, numeric summaries, coefficient-based outcomes, fitted bundle reuse, and evaluation reports.

Motivation

Real tabular data often carries privacy, governance, licensing, or access constraints. These constraints can slow method development, teaching, and reproducible examples. TabDat-Synth provides a small, inspectable synthesis engine that can generate plausible tabular datasets while keeping assumptions visible in configuration.

Intended users

User group Typical use Relevant docs
Research scientists Simulate data for methods work, sensitivity analyses, and reproducible studies. Use cases, concepts
Data scientists and ML engineers Prototype pipelines, benchmark models, and create shareable fixtures. Getting started, user manual
Privacy and governance reviewers Inspect whether generated data is too close to source records. Evaluation metrics, concepts
Educators and students Build realistic classroom or workshop datasets without distributing restricted data. Getting started, examples
Healthcare data collaborators Work with public-style examples inspired by claims-data workflows. Data-file notes

Features

  • Spec-driven generation from YAML configuration.
  • Directed synthesis workflows with stable topological execution.
  • Empirical, truncated-normal, categorical-model, coefficient-sum, sigmoid-probability, and Bernoulli-response steps.
  • Built-in categorical backends for LightGBM and logistic regression.
  • Reusable fitted bundles for generating later without the original source file.
  • Evaluation reports for marginal quality and row-level disclosure-risk heuristics.
  • Small public API designed for programmatic Python workflows.

How it works

A synthesis run has five stages:

  1. Load a generation specification and source table.
  2. Prepare source columns according to the declared schema.
  3. Fit step-level artifacts in DAG order.
  4. Sample synthetic rows from the fitted artifacts.
  5. Evaluate the synthetic table against source data when appropriate.

The DAG controls which generated columns are available to later steps. For categorical-model steps, declared parents define graph edges, and the fitted model receives the full incoming ancestor context in stable order.

Installation

TabDat-Synth requires Python 3.12 or newer and is distributed on PyPI.

uv add tabdat-synth

For one-off installation into an active environment:

uv pip install tabdat-synth

For workflows that load local .env files through package helpers:

uv add "tabdat-synth[env]"

Because the package is alpha software, pin downstream projects to an exact release version when reproducibility matters.

Quick start

Run the coefficient-based example from the repository root:

from tabdat_synth import generate_from_spec, load_generation_spec

spec = load_generation_spec("docs/examples/tiny_coefficient_outcomes.yaml")
df = generate_from_spec(spec)

print(df.shape)
print(df.head())

Fit once, save a reusable bundle, and generate later:

from tabdat_synth import (
  fit_synthesizer,
  generate_from_fitted,
  load_fitted_synthesizer,
  load_generation_spec,
  save_fitted_synthesizer,
)

spec = load_generation_spec("docs/examples/de_synpuf_beneficiary_phase3.yaml")
fitted = fit_synthesizer(spec)
save_fitted_synthesizer(fitted, "tmp/de_synpuf_bundle")

reloaded = load_fitted_synthesizer("tmp/de_synpuf_bundle")
synthetic = generate_from_fitted(reloaded, n_samples=20, random_seed=999)
print(synthetic.shape)

Documentation

  • Getting started: installation, first run, and tests.
  • Use cases: user groups, workflows, and appropriate boundaries.
  • Concepts: motivation, synthesis mechanism, fitted bundles, and limitations.
  • User manual: public API, step reference, backend configuration, and bundle details.
  • Evaluation metrics: fidelity metrics and disclosure-risk heuristics.
  • Synthpop comparison: conceptual relationship to R's synthpop package.

Acknowledgement

TabDat-Synth is conceptually inspired by R's synthpop package and the broader statistical tradition of synthetic data generation.

Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of statistical software, 74, 1-26.

Note that this project is a cleanroom Python implementation, not built by reverse engineering synthpop, nor reuse synthpop source code. See docs/synthpop-comparison.md for a full comparison.

Public API

Common entry points:

  • load_generation_spec(path)
  • generate_from_spec(spec)
  • fit_synthesizer(spec)
  • generate_from_fitted(fitted, *, n_samples=None, random_seed=None)
  • save_fitted_synthesizer(fitted, path)
  • load_fitted_synthesizer(path)
  • evaluate_synthetic_data(source_df, synthetic_df, schema, ...)
  • evaluate_from_spec(spec, synthetic_df, ...)
  • evaluation_report_to_dict(report)
  • save_evaluation_report(report, path)

Testing

uv run --group dev pytest tests/unit tests/regression

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabdat_synth-0.1.0.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabdat_synth-0.1.0-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file tabdat_synth-0.1.0.tar.gz.

File metadata

  • Download URL: tabdat_synth-0.1.0.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tabdat_synth-0.1.0.tar.gz
Algorithm Hash digest
SHA256 93d376cbd5c8a59f7147eb934fa4be9593347fdc8eace878fa86f89f9c9a67e0
MD5 55808b9f8e9048dd22086a1557e8fa9c
BLAKE2b-256 841a3e81749e580e00d19379e135285cafe3a2499dde46dc93cfc46b8c73f99a

See more details on using hashes here.

File details

Details for the file tabdat_synth-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tabdat_synth-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tabdat_synth-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 222964989711d4eb78c7d3965a1450beae3a5b9e17cd2c57ea79e44847fa6a81
MD5 3163b140fa6d9bb06c558583aac752fd
BLAKE2b-256 f416f6e0d657f53b7c5e8fb0807f20c6d231024d85fcd608150180a4136dcf70

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page