Skip to main content

Synthetic Data Quality Assurance

Project description

Synthetic Data Quality Assurance 🔎

Documentation stats license GitHub Release PyPI - Python Version

Documentation | Sample Reports | Technical White Paper

Assess the fidelity and novelty of synthetic samples with respect to original samples:

  1. calculate a rich set of accuracy, similarity and distance metrics
  2. visualize statistics for easy comparison to training and holdout samples
  3. generate a standalone, easy-to-share, easy-to-read HTML summary report

...all with a few lines of Python code 💥.

https://github.com/user-attachments/assets/b27e270a-f19c-4059-b4f2-ed209c9a26b9

Installation

The latest release of mostlyai-qa can be installed via pip:

pip install -U mostlyai-qa

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-qa:

pip install -U torch==2.8.0+cpu torchvision==0.23.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu

Quick Start

import pandas as pd
import webbrowser
from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# fetch original + synthetic data
base_url = "https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start"
syn = pd.read_csv(f"{base_url}/census2k-syn_mostly.csv.gz")
# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn
trn = pd.read_csv(f"{base_url}/census2k-trn.csv.gz")
hol = pd.read_csv(f"{base_url}/census2k-hol.csv.gz")

# calculate metrics
report_path, metrics = qa.report(
    syn_tgt_data=syn,
    trn_tgt_data=trn,
    hol_tgt_data=hol,
)

# pretty print metrics
print(metrics.model_dump_json(indent=4))

# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")

Basic Usage

from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# analyze single-table data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
)

# analyze sequential data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    tgt_context_key = "user_id",
)

# analyze sequential data with context
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    syn_ctx_data = synthetic_context_df,
    trn_ctx_data = training_context_df,
    hol_ctx_data = holdout_context_df,  # optional
    ctx_primary_key = "id",
    tgt_context_key = "user_id",
)

Sample Reports

Citation

Please consider citing our project if you find it useful:

@misc{mostlyai-qa,
      title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},
      author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},
      year={2025},
      eprint={2504.01908},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01908},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_qa-1.10.6.tar.gz (30.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_qa-1.10.6-py3-none-any.whl (30.7 MB view details)

Uploaded Python 3

File details

Details for the file mostlyai_qa-1.10.6.tar.gz.

File metadata

  • Download URL: mostlyai_qa-1.10.6.tar.gz
  • Upload date:
  • Size: 30.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.6.tar.gz
Algorithm Hash digest
SHA256 ea6bbc4a368c5a67b674bbebca456447be437ce19f3be4e50333280bd859a853
MD5 c4ee3ec4f0b97ac29dc94ceeb902728a
BLAKE2b-256 adda86197df8c4018b591e7059307f56f90cf705fdd8b6580e5faed7c39f54f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.6.tar.gz:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mostlyai_qa-1.10.6-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_qa-1.10.6-py3-none-any.whl
  • Upload date:
  • Size: 30.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.6-py3-none-any.whl
Algorithm Hash digest
SHA256 e1cccb27f9a14a0bb3be8a99768fa820619bb803da0434175b148579e9396fea
MD5 7cba863ef556a1b018115b9ad7a8369b
BLAKE2b-256 46211a3eccf3e1eef32a34ebb9be078bf9855a502d060b8bf2809c59915ad6fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.6-py3-none-any.whl:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page