Skip to main content

Synthetic Data Quality Assurance

Project description

Synthetic Data Quality Assurance 🔎

Documentation stats license GitHub Release PyPI - Python Version

Documentation | Sample Reports | Technical White Paper

Assess the fidelity and novelty of synthetic samples with respect to original samples:

  1. calculate a rich set of accuracy, similarity and distance metrics
  2. visualize statistics for easy comparison to training and holdout samples
  3. generate a standalone, easy-to-share, easy-to-read HTML summary report

...all with a few lines of Python code 💥.

https://github.com/user-attachments/assets/b27e270a-f19c-4059-b4f2-ed209c9a26b9

Installation

The latest release of mostlyai-qa can be installed via pip:

pip install -U mostlyai-qa

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-qa:

pip install -U torch==2.8.0+cpu torchvision==0.23.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu

Quick Start

import pandas as pd
import webbrowser
from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# fetch original + synthetic data
base_url = "https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start"
syn = pd.read_csv(f"{base_url}/census2k-syn_mostly.csv.gz")
# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn
trn = pd.read_csv(f"{base_url}/census2k-trn.csv.gz")
hol = pd.read_csv(f"{base_url}/census2k-hol.csv.gz")

# calculate metrics
report_path, metrics = qa.report(
    syn_tgt_data=syn,
    trn_tgt_data=trn,
    hol_tgt_data=hol,
)

# pretty print metrics
print(metrics.model_dump_json(indent=4))

# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")

Basic Usage

from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# analyze single-table data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
)

# analyze sequential data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    tgt_context_key = "user_id",
)

# analyze sequential data with context
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    syn_ctx_data = synthetic_context_df,
    trn_ctx_data = training_context_df,
    hol_ctx_data = holdout_context_df,  # optional
    ctx_primary_key = "id",
    tgt_context_key = "user_id",
)

Sample Reports

Citation

Please consider citing our project if you find it useful:

@misc{mostlyai-qa,
      title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},
      author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},
      year={2025},
      eprint={2504.01908},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01908},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_qa-1.10.5.tar.gz (30.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_qa-1.10.5-py3-none-any.whl (30.7 MB view details)

Uploaded Python 3

File details

Details for the file mostlyai_qa-1.10.5.tar.gz.

File metadata

  • Download URL: mostlyai_qa-1.10.5.tar.gz
  • Upload date:
  • Size: 30.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.5.tar.gz
Algorithm Hash digest
SHA256 9db9c68262f572ad776212b72015ecb2c7618a98e3ce1a26d2625b88a7409677
MD5 9fcba8cbcf40de3e44a1ec12e8eeda26
BLAKE2b-256 3312ea8d9f231a9fd7e35f3171f5369e8f12111e791be34d97f712b822ec3d9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.5.tar.gz:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mostlyai_qa-1.10.5-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_qa-1.10.5-py3-none-any.whl
  • Upload date:
  • Size: 30.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.5-py3-none-any.whl
Algorithm Hash digest
SHA256 60b48ece68f4713cb6ab677a1576807136172d8fe0a71679191b07024321c817
MD5 bc7391f2c15913be874fe5a26449d839
BLAKE2b-256 72b06df33fc1f82209f35b08e954454dd2714e0721898e3262ed5969ec970127

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.5-py3-none-any.whl:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page