Skip to main content

MAP-alignment fidelity for synthetic tabular data

Project description

UPSILON-FIDELITY

MAP-alignment fidelity and dataset distance for synthetic tabular data

This package implements the one-sided MAP-alignment fidelity statistic introduced by Chattopadhyay et al. and described in the manuscript “How Good Is Your Synthetic Data?”.

The core idea:

For a synthetic record to be realistic, each coordinate should agree with the conditional MAP prediction inferred from real data.

Formally, for a data record x and coordinate i:

υ(x, i) = φ_i(x_i | x_{-i}) / max_y φ_i(y | x_{-i})

Averaged over samples and coordinates:

Υ(D) in [0,1]

High Υ => synthetic preserves real conditional structure
Low Υ => structural distortion (even if marginals/covariance match)


Installation

pip install upsilon-fidelity

Optional CTGAN:

pip install upsilon-fidelity[ctgan]

Quick Example

import pandas as pd
from upsilon_fidelity import compute_upsilon

df_real = pd.read_csv("gss_2018.csv").sample(200)

ups_lsm, syn_lsm = compute_upsilon(
    num=100,
    model_path="gss_2018.joblib",
    generate=True,
    gen_algorithm="LSM",
    orig_df=df_real,
    n_workers=8,
)

print("LSM mean Upsilon:", ups_lsm.mean())

Interpretation:

  • ~1.0: synthetic matches conditional structure closely
  • ~0.7: Gaussian-like distortions
  • <<0.7: strong structural mismatch

Why MAP-alignment?

Because covariance matching is insufficient.

Section VII of the manuscript gives explicit examples where:

  • Real and synthetic share identical means, variances, covariance matrices
  • Yet they differ strongly in conditional structure
  • MAP-alignment catches the discrepancy immediately

This method:

  • Detects nonlinear and higher-order structure
  • Avoids feature-embedding artifacts
  • Comes with finite-sample uncertainty control

Supported Generators

  • "LSM": use QuasiNet as a generative model via qsample
  • "BASELINE": independent-column null model
  • "CTGAN": uses SDV CTGAN synthesizer
  • Custom generators also supported

Relationship to Theory

This package implements practical instantiations of:

  • Eq. (2): MAP-alignment for a coordinate
  • Eq. (3): aggregate Υ
  • Algorithm 2: one-sided fidelity score
  • Section VI: uncertainty (Hoeffding bounds)

All without assumptions about the synthetic generator internals.


Citation

Chattopadhyay I, et al.
“How Good Is Your Synthetic Data?”

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsynth-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

File details

Details for the file lsynth-0.1.0.tar.gz.

File metadata

  • Download URL: lsynth-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.12.7

File hashes

Hashes for lsynth-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a00f8ed33ca57571a941531f9bf4b086fa8738f62e5ddf8c2ab84f07cccd0433
MD5 89398a9a651a75f69ada0bb8dc706987
BLAKE2b-256 67f4f88043fee2f0dea72a68c8399b01c87544dc47ae02f1888649de29e22067

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page