Skip to main content

Synthetic Data Quality Assurance

Project description

Synthetic Data Quality Assurance 🔎

Documentation stats license GitHub Release PyPI - Python Version

Documentation | Sample Reports | Technical White Paper

Assess the fidelity and novelty of synthetic samples with respect to original samples:

  1. calculate a rich set of accuracy, similarity and distance metrics
  2. visualize statistics for easy comparison to training and holdout samples
  3. generate a standalone, easy-to-share, easy-to-read HTML summary report

...all with a few lines of Python code 💥.

https://github.com/user-attachments/assets/b27e270a-f19c-4059-b4f2-ed209c9a26b9

Installation

The latest release of mostlyai-qa can be installed via pip:

pip install -U mostlyai-qa

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-qa:

pip install -U torch==2.8.0+cpu torchvision==0.23.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu

Quick Start

import pandas as pd
import webbrowser
from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# fetch original + synthetic data
base_url = "https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start"
syn = pd.read_csv(f"{base_url}/census2k-syn_mostly.csv.gz")
# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn
trn = pd.read_csv(f"{base_url}/census2k-trn.csv.gz")
hol = pd.read_csv(f"{base_url}/census2k-hol.csv.gz")

# calculate metrics
report_path, metrics = qa.report(
    syn_tgt_data=syn,
    trn_tgt_data=trn,
    hol_tgt_data=hol,
)

# pretty print metrics
print(metrics.model_dump_json(indent=4))

# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")

Basic Usage

from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# analyze single-table data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
)

# analyze sequential data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    tgt_context_key = "user_id",
)

# analyze sequential data with context
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    syn_ctx_data = synthetic_context_df,
    trn_ctx_data = training_context_df,
    hol_ctx_data = holdout_context_df,  # optional
    ctx_primary_key = "id",
    tgt_context_key = "user_id",
)

Sample Reports

Citation

Please consider citing our project if you find it useful:

@misc{mostlyai-qa,
      title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},
      author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},
      year={2025},
      eprint={2504.01908},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01908},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_qa-1.10.7.tar.gz (30.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_qa-1.10.7-py3-none-any.whl (30.7 MB view details)

Uploaded Python 3

File details

Details for the file mostlyai_qa-1.10.7.tar.gz.

File metadata

  • Download URL: mostlyai_qa-1.10.7.tar.gz
  • Upload date:
  • Size: 30.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.7.tar.gz
Algorithm Hash digest
SHA256 f9d3e0f306aa6ed7a98e2dba921366db3fa5490a698bc86dee482be01c83dcad
MD5 1cf1b9769cb89b49f375347a5ff7784b
BLAKE2b-256 c89cab1c785ec845c4005a2e6d44c293de0d212ccfec1709c47bab0f99d61c12

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.7.tar.gz:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mostlyai_qa-1.10.7-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_qa-1.10.7-py3-none-any.whl
  • Upload date:
  • Size: 30.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.7-py3-none-any.whl
Algorithm Hash digest
SHA256 da76f7db04ca471b03b92f645bbec80ea94877390b23bf087f274bf021d00b27
MD5 e22b55ce62332edb7921dedc3ea5d559
BLAKE2b-256 084f0237325b0162f58d3269f2a5cf0305d6fcbc90ab2bee17ef0fb831d0a8a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.7-py3-none-any.whl:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page