Transformed-space diffusion for mixed-type tabular data.
Project description
TabDM
TabDM is a small Python library for generating synthetic mixed-type tabular data with diffusion in a transformed feature space.
It is designed for tabular datasets with numeric, categorical, boolean, ordinal, count, and positive continuous columns. The public API focuses on two workflows:
- fit a model and generate synthetic rows in one call
- fit once, then generate repeatedly with optional target or subgroup controls
TabDM also includes evaluation helpers for schema compatibility, distribution fidelity, downstream utility, validity checks, and privacy-screening metrics.
For exact function signatures, parameter semantics, and report shapes, see docs/API_REFERENCE.md.
Install
From a local checkout:
pip install -e .
For evaluation helpers:
pip install -e ".[eval]"
For local development:
pip install -e ".[dev]"
Quick Start
import pandas as pd
from tabdm import generate_synthetic_data
real = pd.DataFrame(
{
"age": [21.0, 35.0, 44.0, 28.0, 31.0, 39.0],
"job": ["admin", "tech", "admin", "services", "tech", "admin"],
"owns_house": [True, False, True, False, True, False],
"balance": [1000.0, 250.0, 1900.0, 750.0, 1200.0, 400.0],
}
)
synthetic = generate_synthetic_data(
real,
num_rows=100,
discrete_columns=["job", "owns_house"],
epochs=50,
timesteps=64,
sample_steps=16,
random_state=42,
)
generate_synthetic_data returns a pandas.DataFrame with the same column
order as the training dataframe. Numeric outputs are clipped to the training
range, count columns are rounded, and discrete columns are decoded to values
observed during fitting.
Fit Once, Generate Many Times
Use fit_tabdm when you want to reuse a fitted model.
from tabdm import fit_tabdm
model = fit_tabdm(
real,
discrete_columns=["job", "owns_house"],
epochs=50,
timesteps=64,
sample_steps=16,
random_state=42,
)
synthetic_a = model.generate(100, random_state=1)
synthetic_b = model.generate(100, random_state=2)
Passing the same random_state to generate produces the same sampled rows for
the same fitted model.
Conditional Generation
TabDM can treat target, sensitive, or explicitly named columns as conditioning columns. Condition columns are not generated by the diffusion model. They are provided by the caller or sampled from the training rows, then recombined with generated feature columns.
real = pd.DataFrame(
{
"age": [21, 35, 44, 28, 31, 39],
"job": ["admin", "tech", "admin", "services", "tech", "admin"],
"sex": ["f", "m", "f", "m", "f", "m"],
"default": ["yes", "no", "yes", "no", "yes", "no"],
}
)
synthetic = generate_synthetic_data(
real,
num_rows=200,
discrete_columns=["job", "sex", "default"],
target_column="default",
sensitive_columns=["sex"],
conditions={"default": "yes"},
condition_strategy="prior",
epochs=50,
random_state=42,
)
Conditioning controls:
| Argument | Meaning |
|---|---|
target_column |
Downstream label to hold fixed or sample separately. |
sensitive_columns |
Subgroup columns to preserve or control. |
condition_on |
Additional columns to use as generation conditions. |
conditions |
Fixed values or row-wise values to use at generation time. |
condition_strategy |
How unspecified condition columns are sampled: prior or balanced. |
conditions can be:
None: sample all condition columns from the training condition rows- a mapping of scalar values: fix those columns and sample the remaining condition columns from matching training rows
- a mapping of sequences: provide row-wise condition values for all condition columns
- a one-row dataframe: repeat the row for every generated sample
- a
num_rows-row dataframe: use row-wise condition values directly
Column Metadata
TabDM can use metadata for columns whose dtype alone is not enough.
metadata = {
"grade_band": {"type": "ordinal", "order": ["low", "mid", "high"]},
"incidents": {"type": "count"},
"tuition": {"type": "positive_continuous"},
}
synthetic = generate_synthetic_data(
real,
discrete_columns=["district"],
column_metadata=metadata,
random_state=42,
)
Supported metadata types:
| Type | Transform behavior | Inverse behavior |
|---|---|---|
ordinal |
Encoded as an ordered scalar using the supplied or inferred order. | Rounded to the nearest ordinal level. |
count |
Encoded with log1p after clipping to non-negative values. |
Decoded with expm1, clipped to the training range, and rounded. |
positive_continuous |
Encoded with log1p after clipping to non-negative values. |
Decoded with expm1 and clipped to the training range. |
Object, string, categorical, and boolean columns are inferred as discrete when
discrete_columns is not supplied. Numeric columns are continuous unless listed
in metadata.
Generation Parameters
generate_synthetic_data exposes the fitting and sampling controls directly.
| Parameter | Default | Description |
|---|---|---|
dataframe |
required | Training dataframe. Must contain at least one row. |
num_rows |
len(dataframe) |
Number of synthetic rows. Must be positive when provided. |
discrete_columns |
inferred | Categorical columns. Accepts column names. |
column_metadata |
None |
Metadata for ordinal, count, or positive continuous columns. |
target_column |
None |
Column to condition on rather than generate. |
sensitive_columns |
None |
Additional condition columns, usually subgroup attributes. |
condition_on |
None |
Other condition columns. |
conditions |
None |
Fixed or row-wise generation conditions. |
condition_strategy |
"prior" |
prior samples training condition rows; balanced samples unique condition rows uniformly. |
hidden_dims |
(256, 256) |
MLP denoiser hidden layer sizes. |
time_embedding_dim |
64 |
Sinusoidal timestep embedding size. |
timesteps |
96 |
Number of training diffusion timesteps. |
sample_steps |
24 |
Number of deterministic reverse steps used during sampling. |
epochs |
120 |
Training epochs. |
batch_size |
512 |
Training and sampling batch size. |
learning_rate |
1e-3 |
AdamW learning rate. |
weight_decay |
1e-6 |
AdamW weight decay. |
beta_start |
1e-4 |
First value in the linear noise schedule. |
beta_end |
0.02 |
Last value in the linear noise schedule. |
dropout |
0.0 |
Dropout inside the denoiser MLP. |
discrete_loss_weight |
2.0 |
Multiplier for one-hot categorical spans in the training loss. |
prediction_clip |
1.5 |
Clamp applied to predicted transformed features. |
grad_clip_norm |
1.0 |
Gradient norm clipping threshold. Use 0 to disable. |
device |
"cpu" |
"cpu" or a CUDA device string. Falls back to CPU if CUDA is unavailable. |
random_state |
None |
Seeds Python, NumPy, and Torch during fitting, and seeds sampling noise when generating. |
verbose |
False |
Print training loss periodically. |
return_model |
False |
Return SyntheticDataResult with the fitted model and metadata. |
Lower-Level Model API
from tabdm import TabDM, TabDMConfig
model = TabDM(
TabDMConfig(
hidden_dims=(256, 256),
time_embedding_dim=64,
timesteps=64,
sample_steps=16,
epochs=50,
batch_size=512,
random_state=42,
)
)
model.fit(real, discrete_columns=["job", "owns_house"])
synthetic = model.sample(100, random_state=42)
Use TabDM directly if you want to hold a model object, inspect
fit_history_, or call sample repeatedly.
Evaluation
Install the optional evaluation dependencies first:
pip install -e ".[eval]"
Then call evaluate_synthetic.
from tabdm import evaluate_synthetic
report = evaluate_synthetic(
real=real,
synthetic=synthetic,
target_column="default",
random_state=42,
)
print(report["schema"])
print(report["distribution"])
print(report["validity"])
print(report["utility"])
print(report["trust"])
evaluate_synthetic can compute:
| Group | Included by default | Description |
|---|---|---|
schema |
yes | Column presence, column order compatibility, and dtype mismatches. |
distribution |
yes | Categorical total variation distance, numeric KS distance, and numeric correlation delta. |
validity |
yes | Numeric bound violations and unseen categorical values. |
utility |
yes when target_column is provided |
Train-on-synthetic, test-on-real downstream utility. |
trust |
yes | Exact row matches and nearest-neighbor privacy-screening metrics. |
You can select metric groups explicitly:
report = evaluate_synthetic(
real=real,
synthetic=synthetic,
target_column="default",
metrics=("schema", "distribution", "validity"),
include_trust=False,
)
Task type is inferred as classification for object, string, categorical,
boolean, and low-cardinality integer targets. Floating numeric targets are
treated as regression. Override with task_type="classification" or
task_type="regression" when needed.
Evaluation Helpers
The public evaluation helpers are:
evaluate_syntheticevaluate_utilityschema_reportdistribution_reportvalidity_reporttrust_reportexact_match_ratenearest_neighbor_privacycategorical_tvdnumeric_ksnumeric_correlation_deltainfer_task_type
Privacy-screening metrics are diagnostics only. They do not prove anonymization, differential privacy, or legal compliance.
Public API
tabdm.TabDMtabdm.TabDMConfigtabdm.DataTransformertabdm.SyntheticDataResulttabdm.fit_tabdmtabdm.generate_synthetic_datatabdm.infer_discrete_columnstabdm.evaluate_synthetictabdm.evaluate_utilitytabdm.trust_report
Testing
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 avoids unrelated third-party pytest plugin
startup issues in environments with many globally installed plugins.
License
TabDM is distributed under the Apache License 2.0.
Scope
This package ships the core generation and evaluation APIs.
TabDM is an alpha research/development package. Always evaluate generated data for the intended dataset, task, and privacy posture before use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabdm-0.1.0.tar.gz.
File metadata
- Download URL: tabdm-0.1.0.tar.gz
- Upload date:
- Size: 34.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78e6841662897c9d2aa45d34af0711e3d785225681ed83a23883bb6cf3691682
|
|
| MD5 |
da6fd9dab277b38fa6315833c2b2edc2
|
|
| BLAKE2b-256 |
b2bcd5d5125acd2a094ad7664268dd779e6bf2d5e414bbe520850c720c1e7dbb
|
File details
Details for the file tabdm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tabdm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47ff9b504f5b09d3c870b3c9050389e0b73759abe02cb70455a06a6645a8f079
|
|
| MD5 |
fd6d0916566a6bf76066ecfb1f352a2a
|
|
| BLAKE2b-256 |
a5f66d689dc7f972f6b768d20f7ecb1eff5cfe34efd816c5a6a0fc6afc48d44e
|