Skip to main content

Synthetic Data Quality Assurance

Project description

Synthetic Data Quality Assurance 🔎

Documentation stats license GitHub Release PyPI - Python Version

Documentation | Sample Reports | Technical White Paper

Assess the fidelity and novelty of synthetic samples with respect to original samples:

  1. calculate a rich set of accuracy, similarity and distance metrics
  2. visualize statistics for easy comparison to training and holdout samples
  3. generate a standalone, easy-to-share, easy-to-read HTML summary report

...all with a few lines of Python code 💥.

https://github.com/user-attachments/assets/b27e270a-f19c-4059-b4f2-ed209c9a26b9

Installation

The latest release of mostlyai-qa can be installed via pip:

pip install -U mostlyai-qa

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-qa:

pip install -U torch==2.7.0+cpu torchvision==0.22.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu

Quick Start

import pandas as pd
import webbrowser
from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# fetch original + synthetic data
base_url = "https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start"
syn = pd.read_csv(f"{base_url}/census2k-syn_mostly.csv.gz")
# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn
trn = pd.read_csv(f"{base_url}/census2k-trn.csv.gz")
hol = pd.read_csv(f"{base_url}/census2k-hol.csv.gz")

# calculate metrics
report_path, metrics = qa.report(
    syn_tgt_data=syn,
    trn_tgt_data=trn,
    hol_tgt_data=hol,
)

# pretty print metrics
print(metrics.model_dump_json(indent=4))

# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")

Basic Usage

from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# analyze single-table data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
)

# analyze sequential data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    tgt_context_key = "user_id",
)

# analyze sequential data with context
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    syn_ctx_data = synthetic_context_df,
    trn_ctx_data = training_context_df,
    hol_ctx_data = holdout_context_df,  # optional
    ctx_primary_key = "id",
    tgt_context_key = "user_id",
)

Sample Reports

Citation

Please consider citing our project if you find it useful:

@misc{mostlyai-qa,
      title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},
      author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},
      year={2025},
      eprint={2504.01908},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01908},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_qa-1.9.9.tar.gz (30.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_qa-1.9.9-py3-none-any.whl (30.7 MB view details)

Uploaded Python 3

File details

Details for the file mostlyai_qa-1.9.9.tar.gz.

File metadata

  • Download URL: mostlyai_qa-1.9.9.tar.gz
  • Upload date:
  • Size: 30.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai_qa-1.9.9.tar.gz
Algorithm Hash digest
SHA256 858e694932862dbd5f6fbaeadfb7131fd871cfe8f3c030115529678cfb578ce8
MD5 0b5fe038896483854b640a50506625e0
BLAKE2b-256 5acd3249ab8a6c323960399ffb6fca685c50ca5c5dbab9ca51e823495660ea3b

See more details on using hashes here.

File details

Details for the file mostlyai_qa-1.9.9-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_qa-1.9.9-py3-none-any.whl
  • Upload date:
  • Size: 30.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai_qa-1.9.9-py3-none-any.whl
Algorithm Hash digest
SHA256 f6ab7844ba6f9ed8c0d16c8771db863e646345b944d7ee1fd49a52887f676b3e
MD5 f9b8e006a13bc1a11405dda9654302ef
BLAKE2b-256 132fad1d64dbc1123f6e4e7dbc7d4c36739f06a412696fb264f22f95c43b50ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page