Synthetic Data Quality Assurance
Project description
Synthetic Data Quality Assurance 🔎
Documentation | Sample Reports | Technical White Paper
Assess the fidelity and novelty of synthetic samples with respect to original samples:
- calculate a rich set of accuracy, similarity and distance metrics
- visualize statistics for easy comparison to training and holdout samples
- generate a standalone, easy-to-share, easy-to-read HTML summary report
...all with a few lines of Python code 💥.
https://github.com/user-attachments/assets/b27e270a-f19c-4059-b4f2-ed209c9a26b9
Installation
The latest release of mostlyai-qa can be installed via pip:
pip install -U mostlyai-qa
On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-qa:
pip install -U torch==2.7.0+cpu torchvision==0.22.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu
Quick Start
import pandas as pd
import webbrowser
from mostlyai import qa
# initialize logging to stdout
qa.init_logging()
# fetch original + synthetic data
base_url = "https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start"
syn = pd.read_csv(f"{base_url}/census2k-syn_mostly.csv.gz")
# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn
trn = pd.read_csv(f"{base_url}/census2k-trn.csv.gz")
hol = pd.read_csv(f"{base_url}/census2k-hol.csv.gz")
# calculate metrics
report_path, metrics = qa.report(
syn_tgt_data=syn,
trn_tgt_data=trn,
hol_tgt_data=hol,
)
# pretty print metrics
print(metrics.model_dump_json(indent=4))
# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")
Basic Usage
from mostlyai import qa
# initialize logging to stdout
qa.init_logging()
# analyze single-table data
report_path, metrics = qa.report(
syn_tgt_data = synthetic_df,
trn_tgt_data = training_df,
hol_tgt_data = holdout_df, # optional
)
# analyze sequential data
report_path, metrics = qa.report(
syn_tgt_data = synthetic_df,
trn_tgt_data = training_df,
hol_tgt_data = holdout_df, # optional
tgt_context_key = "user_id",
)
# analyze sequential data with context
report_path, metrics = qa.report(
syn_tgt_data = synthetic_df,
trn_tgt_data = training_df,
hol_tgt_data = holdout_df, # optional
syn_ctx_data = synthetic_context_df,
trn_ctx_data = training_context_df,
hol_ctx_data = holdout_context_df, # optional
ctx_primary_key = "id",
tgt_context_key = "user_id",
)
Sample Reports
- Baseball Players (Flat Data)
- Baseball Seasons (Sequential Data)
Citation
Please consider citing our project if you find it useful:
@misc{mostlyai-qa,
title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},
author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},
year={2025},
eprint={2504.01908},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.01908},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mostlyai_qa-1.9.9.tar.gz.
File metadata
- Download URL: mostlyai_qa-1.9.9.tar.gz
- Upload date:
- Size: 30.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
858e694932862dbd5f6fbaeadfb7131fd871cfe8f3c030115529678cfb578ce8
|
|
| MD5 |
0b5fe038896483854b640a50506625e0
|
|
| BLAKE2b-256 |
5acd3249ab8a6c323960399ffb6fca685c50ca5c5dbab9ca51e823495660ea3b
|
File details
Details for the file mostlyai_qa-1.9.9-py3-none-any.whl.
File metadata
- Download URL: mostlyai_qa-1.9.9-py3-none-any.whl
- Upload date:
- Size: 30.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6ab7844ba6f9ed8c0d16c8771db863e646345b944d7ee1fd49a52887f676b3e
|
|
| MD5 |
f9b8e006a13bc1a11405dda9654302ef
|
|
| BLAKE2b-256 |
132fad1d64dbc1123f6e4e7dbc7d4c36739f06a412696fb264f22f95c43b50ee
|