Spec-driven tabular data synthesis with reusable fitted bundles.

These details have not been verified by PyPI

Project links

Project description

TabDat-Synth

banner

TabDat-Synth is a Python package for spec-driven synthetic tabular data generation. It is intended for research, education, benchmarking, and data-science prototyping where real tabular data is sensitive, unavailable, or inconvenient to share.

The package learns reusable synthesis artifacts from a source table, then generates new rows from an explicit directed acyclic graph (DAG) specification. It supports empirical sampling, conditioned categorical models, numeric summaries, coefficient-based outcomes, fitted bundle reuse, and evaluation reports.

Motivation

Real tabular data often carries privacy, governance, licensing, or access constraints. These constraints can slow method development, teaching, and reproducible examples. TabDat-Synth provides a small, inspectable synthesis engine that can generate plausible tabular datasets while keeping assumptions visible in configuration.

Intended users

User group	Typical use	Relevant docs
Research scientists	Simulate data for methods work, sensitivity analyses, and reproducible studies.	Use cases, concepts
Data scientists and ML engineers	Prototype pipelines, benchmark models, and create shareable fixtures.	Getting started, user manual
Privacy and governance reviewers	Inspect whether generated data is too close to source records.	Evaluation metrics, concepts
Educators and students	Build realistic classroom or workshop datasets without distributing restricted data.	Getting started, examples
Healthcare data collaborators	Work with public-style examples inspired by claims-data workflows.	Data-file notes

Features

Spec-driven generation from YAML configuration.
Directed synthesis workflows with stable topological execution.
Empirical, truncated-normal, categorical-model, coefficient-sum, sigmoid-probability, and Bernoulli-response steps.
Built-in categorical backends for LightGBM and logistic regression.
Reusable fitted bundles for generating later without the original source file.
Evaluation reports for marginal quality and row-level disclosure-risk heuristics.
Small public API designed for programmatic Python workflows.

How it works

A synthesis run has five stages:

Load a generation specification and source table.
Prepare source columns according to the declared schema.
Fit step-level artifacts in DAG order.
Sample synthetic rows from the fitted artifacts.
Evaluate the synthetic table against source data when appropriate.

The DAG controls which generated columns are available to later steps. For categorical-model steps, declared parents define graph edges, and the fitted model receives the full incoming ancestor context in stable order.

Installation

TabDat-Synth requires Python 3.12 or newer and is distributed on PyPI.

uv add tabdat-synth

For one-off installation into an active environment:

uv pip install tabdat-synth

For workflows that load local .env files through package helpers:

uv add "tabdat-synth[env]"

Because the package is alpha software, pin downstream projects to an exact release version when reproducibility matters.

Quick start

Run the coefficient-based example from the repository root:

from tabdat_synth import generate_from_spec, load_generation_spec

spec = load_generation_spec("docs/examples/tiny_coefficient_outcomes.yaml")
df = generate_from_spec(spec)

print(df.shape)
print(df.head())

Fit once, save a reusable bundle, and generate later:

from tabdat_synth import (
  fit_synthesizer,
  generate_from_fitted,
  load_fitted_synthesizer,
  load_generation_spec,
  save_fitted_synthesizer,
)

spec = load_generation_spec("docs/examples/de_synpuf_beneficiary_phase3.yaml")
fitted = fit_synthesizer(spec)
save_fitted_synthesizer(fitted, "tmp/de_synpuf_bundle")

reloaded = load_fitted_synthesizer("tmp/de_synpuf_bundle")
synthetic = generate_from_fitted(reloaded, n_samples=20, random_seed=999)
print(synthetic.shape)

Documentation

Getting started: installation, first run, and tests.
Use cases: user groups, workflows, and appropriate boundaries.
Concepts: motivation, synthesis mechanism, fitted bundles, and limitations.
User manual: public API, step reference, backend configuration, and bundle details.
Evaluation metrics: fidelity metrics and disclosure-risk heuristics.
Synthpop comparison: conceptual relationship to R's synthpop package.

Acknowledgement

TabDat-Synth is conceptually inspired by R's synthpop package and the broader statistical tradition of synthetic data generation.

Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of statistical software, 74, 1-26.

Note that this project is a cleanroom Python implementation, not built by reverse engineering synthpop, nor reuse synthpop source code. See docs/synthpop-comparison.md for a full comparison.

Public API

Common entry points:

load_generation_spec(path)
generate_from_spec(spec)
fit_synthesizer(spec)
generate_from_fitted(fitted, *, n_samples=None, random_seed=None)
save_fitted_synthesizer(fitted, path)
load_fitted_synthesizer(path)
evaluate_synthetic_data(source_df, synthetic_df, schema, ...)
evaluate_from_spec(spec, synthetic_df, ...)
evaluation_report_to_dict(report)
save_evaluation_report(report, path)

Testing

uv run --group dev pytest tests/unit tests/regression

License

Apache-2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabdat_synth-0.1.0.tar.gz (33.1 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabdat_synth-0.1.0-py3-none-any.whl (37.3 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file tabdat_synth-0.1.0.tar.gz.

File metadata

Download URL: tabdat_synth-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 33.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tabdat_synth-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`93d376cbd5c8a59f7147eb934fa4be9593347fdc8eace878fa86f89f9c9a67e0`
MD5	`55808b9f8e9048dd22086a1557e8fa9c`
BLAKE2b-256	`841a3e81749e580e00d19379e135285cafe3a2499dde46dc93cfc46b8c73f99a`

See more details on using hashes here.

File details

Details for the file tabdat_synth-0.1.0-py3-none-any.whl.

File metadata

Download URL: tabdat_synth-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 37.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tabdat_synth-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`222964989711d4eb78c7d3965a1450beae3a5b9e17cd2c57ea79e44847fa6a81`
MD5	`3163b140fa6d9bb06c558583aac752fd`
BLAKE2b-256	`f416f6e0d657f53b7c5e8fb0807f20c6d231024d85fcd608150180a4136dcf70`

See more details on using hashes here.

tabdat-synth 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TabDat-Synth

Motivation

Intended users

Features

How it works

Installation

Quick start

Documentation

Acknowledgement

Public API

Testing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes