Skip to main content

Synthetic Data Quality Assurance

Project description

Synthetic Data Quality Assurance 🔎

Documentation stats license GitHub Release PyPI - Python Version

Documentation | Sample Reports | Technical White Paper

Assess the fidelity and novelty of synthetic samples with respect to original samples:

  1. calculate a rich set of accuracy, similarity and distance metrics
  2. visualize statistics for easy comparison to training and holdout samples
  3. generate a standalone, easy-to-share, easy-to-read HTML summary report

...all with a few lines of Python code 💥.

https://github.com/user-attachments/assets/b27e270a-f19c-4059-b4f2-ed209c9a26b9

Installation

The latest release of mostlyai-qa can be installed via pip:

pip install -U mostlyai-qa

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-qa:

pip install -U torch==2.8.0+cpu torchvision==0.23.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu

Quick Start

import pandas as pd
import webbrowser
from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# fetch original + synthetic data
base_url = "https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start"
syn = pd.read_csv(f"{base_url}/census2k-syn_mostly.csv.gz")
# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn
trn = pd.read_csv(f"{base_url}/census2k-trn.csv.gz")
hol = pd.read_csv(f"{base_url}/census2k-hol.csv.gz")

# calculate metrics
report_path, metrics = qa.report(
    syn_tgt_data=syn,
    trn_tgt_data=trn,
    hol_tgt_data=hol,
)

# pretty print metrics
print(metrics.model_dump_json(indent=4))

# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")

Basic Usage

from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# analyze single-table data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
)

# analyze sequential data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    tgt_context_key = "user_id",
)

# analyze sequential data with context
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    syn_ctx_data = synthetic_context_df,
    trn_ctx_data = training_context_df,
    hol_ctx_data = holdout_context_df,  # optional
    ctx_primary_key = "id",
    tgt_context_key = "user_id",
)

Sample Reports

Citation

Please consider citing our project if you find it useful:

@misc{mostlyai-qa,
      title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},
      author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},
      year={2025},
      eprint={2504.01908},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01908},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_qa-1.10.4.tar.gz (30.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_qa-1.10.4-py3-none-any.whl (30.7 MB view details)

Uploaded Python 3

File details

Details for the file mostlyai_qa-1.10.4.tar.gz.

File metadata

  • Download URL: mostlyai_qa-1.10.4.tar.gz
  • Upload date:
  • Size: 30.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.4.tar.gz
Algorithm Hash digest
SHA256 d50cdc581af5504cefbc0c08db497ad061fec6420f4736f515bd170bda04ed58
MD5 ca9dc79df9e189daafa6fcda9acee98b
BLAKE2b-256 9513ab5c06464b3b4ca4ad074ff65da167e9fbc90ffab80a6da58e1d01505541

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.4.tar.gz:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mostlyai_qa-1.10.4-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_qa-1.10.4-py3-none-any.whl
  • Upload date:
  • Size: 30.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_qa-1.10.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1a2d182205319522b9182c473014445967f64a3324d78181879df1df8f8cb6c5
MD5 8d926b367633e67fc8a88ca6ed908f07
BLAKE2b-256 dc2e8f006872e0dc45916eaa2efafee8ee9c2448b01bea4592a72952c0a4d25e

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_qa-1.10.4-py3-none-any.whl:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page