Skip to main content

Quality assurance for synthetic data

Project description

Synthetic Data - Quality Assurance

Assess the fidelity and novelty of synthetic samples with respect to original samples:

  1. calculate a rich set of accuracy, similarity and distance metrics
  2. visualize statistics for easy comparison to training and holdout samples
  3. generate a standalone, easy-to-share, easy-to-read HTML summary report

...all with a single line of Python code 💥.

Getting Started

Installation

pip install -U mostlyai-qa

Basic Usage

from mostlyai import qa

# analyze single-table data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
)

# analyze sequential data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    tgt_context_key = "user_id",
)

# analyze sequential data with context
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    syn_ctx_data = synthetic_context_df,
    trn_ctx_data = training_context_df,
    hol_ctx_data = holdout_context_df,  # optional
    ctx_primary_key = "id",
    tgt_context_key = "user_id",
)

Syntax

def report(
    *,
    syn_tgt_data: pd.DataFrame,
    trn_tgt_data: pd.DataFrame,
    hol_tgt_data: pd.DataFrame | None = None,
    syn_ctx_data: pd.DataFrame | None = None,
    trn_ctx_data: pd.DataFrame | None = None,
    hol_ctx_data: pd.DataFrame | None = None,
    ctx_primary_key: str | None = None,
    tgt_context_key: str | None = None,
    report_path: str | Path | None = "model-report.html",
    report_title: str = "Model Report",
    report_subtitle: str = "",
    report_credits: str = REPORT_CREDITS,
    report_extra_info: str = "",
    max_sample_size_accuracy: int = MAX_SAMPLE_SIZE_ACCURACY,
    max_sample_size_embeddings: int = MAX_SAMPLE_SIZE_EMBEDDINGS,
    statistics_path: str | Path | None = None,
    on_progress: ProgressCallback | None = None,
) -> tuple[Path, dict | None]:
    """
    Generate HTML report and metrics for comparing synthetic and original data samples.

    Args:
        syn_tgt_data: Synthetic samples
        trn_tgt_data: Training samples
        hol_tgt_data: Holdout samples
        syn_ctx_data: Synthetic context samples
        trn_ctx_data: Training context samples
        hol_ctx_data: Holdout context samples
        ctx_primary_key: Column within the context data that contains the primary key
        tgt_context_key: Column within the target data that contains the key to link to the context
        report_path: Path of where to store the HTML report
        report_title: Title of the HTML report
        report_subtitle: Subtitle of the HTML report
        report_credits: Credits of the HTML report
        report_extra_info: Extra details to be included to the HTML report
        max_sample_size_accuracy: Max sample size for accuracy
        max_sample_size_embeddings: Max sample size for embeddings (similarity & distances)
        statistics_path: Path of where to store the statistics to be used by `report_from_statistics`
        on_progress: A custom progress callback
    Returns:
        1. Path to the HTML report
        2. Dictionary of calculated metrics:
        - `accuracy`:  # Accuracy is defined as (100% - Total Variation Distance), for each distribution, and then averaged across.
          - `overall`: Overall accuracy of synthetic data, i.e. average across univariate, bivariate and coherence.
          - `univariate`: Average accuracy of discretized univariate distributions.
          - `bivariate`: Average accuracy of discretized bivariate distributions.
          - `coherence`: Average accuracy of discretized coherence distributions. Only applicable for sequential data.
          - `overall_max`: Expected overall accuracy of a same-sized holdout. Serves as reference for `overall`.
          - `univariate_max`: Expected univariate accuracy of a same-sized holdout. Serves as reference for `univariate`.
          - `bivariate_max`: Expected bivariate accuracy of a same-sized holdout. Serves as reference for `bivariate`.
          - `coherence_max`: Expected coherence accuracy of a same-sized holdout. Serves as reference for `coherence`.
        - `similarity`:  # All similarity metrics are calculated within an embedding space.
            - `cosine_similarity_training_synthetic`: Cosine similarity between training and synthetic centroids.
            - `cosine_similarity_training_holdout`: Cosine similarity between training and holdout centroids. Serves as reference for `cosine_similarity_training_synthetic`.
            - `discriminator_auc_training_synthetic`: Cross-validated AUC of a discriminative model to distinguish between training and synthetic samples.
            - `discriminator_auc_training_holdout`: Cross-validated AUC of a discriminative model to distinguish between training and holdout samples. Serves as reference for `discriminator_auc_training_synthetic`.
        - `distances`:  # All distance metrics are calculated within an embedding space. An equal number of training and holdout samples is considered.
            - `ims_training`: Share of synthetic samples that are identical to a training sample.
            - `ims_holdout`: Share of synthetic samples that are identical to a holdout sample. Serves as reference for `ims_training`.
            - `dcr_training`: Average L2 nearest-neighbor distance between synthetic and training samples.
            - `dcr_holdout`: Average L2 nearest-neighbor distance between synthetic and holdout samples. Serves as reference for `dcr_training`.
            - `dcr_share`: Share of synthetic samples that are closer to a training sample than to a holdout sample. This shall not be significantly larger than 50\%.
    """

Metrics

We calculate three sets of metrics to compare synthetic data with the original data.

Accuracy

We calculate discretized marginal distributions for all columns, to then calculate the L1 distance between the synthetic and the original training data. The reported accuracy is then expressed as 100% minus the total variational distance (TVD), which is half the L1 distance between the two distributions. We then average across these accuracies to get a single accuracy score. The higher the score, the better the synthetic data.

  1. Univariate Accuracy: We measure the accuracy for the univariate distributions for all target columns.
  2. Bivariate Accuracy: We measure the accuracy for all pair-wise distributions for target columns, as well as for target columns with respect to the context columns.
  3. Coherence Accuracy: We measure the accuracy for the auto-correlation for all target columns. Only applicable for sequential data.

An overall accuracy score is then calculated as the average of these aggregate-level scores.

Similarity

We embed all records into an embedding space, to calculate two metrics:

  1. Cosing Similarity: We calculate the cosine similarity between the centroids of the synthetic and the original training data. This is then compared to the cosine similarity between the centroids of the original training and holdout data. The higher the score, the better the synthetic data.
  2. Discriminator AUC: We train a binary classifier to check whether one can distinguish between synthetic and original training data based on their embeddings. This is again compared to the same metric for the original training and holdout data. A score close to 50% indicates, that synthetic samples are indistinguishable from original samples.

Distances

We again embed all records into an embedding space, to then measure individual-level L2 distances between samples. For each synthetic sample, we calculate the distance to the nearest original sample (DCR). We once do this with respect to original training records, and once with respect to holdout records, and then compare these DCRs to each other. For privacy-safe synthetic data we expect to see that synthetic data is just as close to original training data, as it is to original holdout data.

Sample HTML Report

Metrics Accuracy Univariates Accuracy Bivariates Accuracy Coherence Similarity Distances

See the examples directory for further examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_qa-1.0.0.tar.gz (85.4 MB view details)

Uploaded Source

Built Distribution

mostlyai_qa-1.0.0-py3-none-any.whl (85.5 MB view details)

Uploaded Python 3

File details

Details for the file mostlyai_qa-1.0.0.tar.gz.

File metadata

  • Download URL: mostlyai_qa-1.0.0.tar.gz
  • Upload date:
  • Size: 85.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.0

File hashes

Hashes for mostlyai_qa-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f408de67576906cbec5b761dc6c0eb2f0f03c2d1e06d09ddd74811fb236e2c0c
MD5 1d5737c388a05ff07612b1e9c114c298
BLAKE2b-256 8dd08e4853e7874bc3fa944aba038ce18edf35db631953940adcf8ed79e71d7b

See more details on using hashes here.

File details

Details for the file mostlyai_qa-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_qa-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 85.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.0

File hashes

Hashes for mostlyai_qa-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 688ba7ab6c270528a19b6493e16f0ad170855db2b3c36c8a721ec57b8e8a2cc4
MD5 214a4f3403c347ed9f66f74f39b11461
BLAKE2b-256 62722dc51c3f939c0fd6fc78309f792f8b27704e7d998a9eafa0eae97962c83e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page