Spec-driven tabular data synthesis with reusable fitted bundles.
Project description
TabDat-Synth
TabDat-Synth is a Python package for spec-driven synthetic tabular data generation. It is intended for research, education, benchmarking, and data-science prototyping where real tabular data is sensitive, unavailable, or inconvenient to share.
The package learns reusable synthesis artifacts from a source table, then generates new rows from an explicit directed acyclic graph (DAG) specification. It supports empirical sampling, conditioned categorical models, numeric summaries, coefficient-based outcomes, fitted bundle reuse, and evaluation reports.
Motivation
Real tabular data often carries privacy, governance, licensing, or access constraints. These constraints can slow method development, teaching, and reproducible examples. TabDat-Synth provides a small, inspectable synthesis engine that can generate plausible tabular datasets while keeping assumptions visible in configuration.
Intended users
| User group | Typical use | Relevant docs |
|---|---|---|
| Research scientists | Simulate data for methods work, sensitivity analyses, and reproducible studies. | Use cases, concepts |
| Data scientists and ML engineers | Prototype pipelines, benchmark models, and create shareable fixtures. | Getting started, user manual |
| Privacy and governance reviewers | Inspect whether generated data is too close to source records. | Evaluation metrics, concepts |
| Educators and students | Build realistic classroom or workshop datasets without distributing restricted data. | Getting started, examples |
| Healthcare data collaborators | Work with public-style examples inspired by claims-data workflows. | Data-file notes |
Features
- Spec-driven generation from YAML configuration.
- Directed synthesis workflows with stable topological execution.
- Empirical, truncated-normal, categorical-model, coefficient-sum, sigmoid-probability, and Bernoulli-response steps.
- Built-in categorical backends for LightGBM and logistic regression.
- Reusable fitted bundles for generating later without the original source file.
- Evaluation reports for marginal quality and row-level disclosure-risk heuristics.
- Small public API designed for programmatic Python workflows.
How it works
A synthesis run has five stages:
- Load a generation specification and source table.
- Prepare source columns according to the declared schema.
- Fit step-level artifacts in DAG order.
- Sample synthetic rows from the fitted artifacts.
- Evaluate the synthetic table against source data when appropriate.
The DAG controls which generated columns are available to later steps. For categorical-model steps, declared parents define graph edges, and the fitted model receives the full incoming ancestor context in stable order.
Installation
TabDat-Synth requires Python 3.12 or newer and is distributed on PyPI.
uv add tabdat-synth
For one-off installation into an active environment:
uv pip install tabdat-synth
For workflows that load local .env files through package helpers:
uv add "tabdat-synth[env]"
Because the package is alpha software, pin downstream projects to an exact release version when reproducibility matters.
Quick start
Run the coefficient-based example from the repository root:
from tabdat_synth import generate_from_spec, load_generation_spec
spec = load_generation_spec("docs/examples/tiny_coefficient_outcomes.yaml")
df = generate_from_spec(spec)
print(df.shape)
print(df.head())
Fit once, save a reusable bundle, and generate later:
from tabdat_synth import (
fit_synthesizer,
generate_from_fitted,
load_fitted_synthesizer,
load_generation_spec,
save_fitted_synthesizer,
)
spec = load_generation_spec("docs/examples/de_synpuf_beneficiary_phase3.yaml")
fitted = fit_synthesizer(spec)
save_fitted_synthesizer(fitted, "tmp/de_synpuf_bundle")
reloaded = load_fitted_synthesizer("tmp/de_synpuf_bundle")
synthetic = generate_from_fitted(reloaded, n_samples=20, random_seed=999)
print(synthetic.shape)
Documentation
- Getting started: installation, first run, and tests.
- Use cases: user groups, workflows, and appropriate boundaries.
- Concepts: motivation, synthesis mechanism, fitted bundles, and limitations.
- User manual: public API, step reference, backend configuration, and bundle details.
- Evaluation metrics: fidelity metrics and disclosure-risk heuristics.
- Synthpop comparison: conceptual relationship to R's
synthpoppackage.
Acknowledgement
TabDat-Synth is conceptually inspired by R's synthpop package and the broader statistical tradition of synthetic data generation.
Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of statistical software, 74, 1-26.
Note that this project is a cleanroom Python implementation, not built by reverse engineering synthpop, nor reuse synthpop source code.
See docs/synthpop-comparison.md for a full comparison.
Public API
Common entry points:
load_generation_spec(path)generate_from_spec(spec)fit_synthesizer(spec)generate_from_fitted(fitted, *, n_samples=None, random_seed=None)save_fitted_synthesizer(fitted, path)load_fitted_synthesizer(path)evaluate_synthetic_data(source_df, synthetic_df, schema, ...)evaluate_from_spec(spec, synthetic_df, ...)evaluation_report_to_dict(report)save_evaluation_report(report, path)
Testing
uv run --group dev pytest tests/unit tests/regression
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabdat_synth-0.1.0.tar.gz.
File metadata
- Download URL: tabdat_synth-0.1.0.tar.gz
- Upload date:
- Size: 33.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93d376cbd5c8a59f7147eb934fa4be9593347fdc8eace878fa86f89f9c9a67e0
|
|
| MD5 |
55808b9f8e9048dd22086a1557e8fa9c
|
|
| BLAKE2b-256 |
841a3e81749e580e00d19379e135285cafe3a2499dde46dc93cfc46b8c73f99a
|
File details
Details for the file tabdat_synth-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tabdat_synth-0.1.0-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
222964989711d4eb78c7d3965a1450beae3a5b9e17cd2c57ea79e44847fa6a81
|
|
| MD5 |
3163b140fa6d9bb06c558583aac752fd
|
|
| BLAKE2b-256 |
f416f6e0d657f53b7c5e8fb0807f20c6d231024d85fcd608150180a4136dcf70
|