Skip to main content

A package for evaluating synthetic data

Project description

SynthGauge

SynthGauge is a Python library providing a framework in which to evaluate synthetically generated data.

The library provides a range of metrics and visualisations for assessing and comparing distributions of features between real and synthetic data. At its core is the Evaluator class, which provides a consistent interface for assessing two sets of data. By creating several Evaluator instances, you can easily evaluate synthetic data generated from a range of methods in a consistent and comparable manner.

Privacy vs. Utility

:lock: vs. :bar_chart:

When generating synthetic data, there is generally a trade-off between privacy (i.e. keeping sensitive information private) and utility (i.e. ensuring the dataset is still fit for purpose).

The metrics included in SynthGauge fall into these categories and work is continuing to add more metrics.

Mission Statement

SynthGauge is a toolkit providing metrics and visualisations that aid the user in the assessment of their synthetic data.

SynthGauge is not going to make any decisions on behalf of the user. It won’t specify if one synthetic dataset is better than another. This decision is dataset- and purpose-dependent so can vary widely from user to user.

Simply, SynthGauge is a decision-support tool, not a decision-making tool.

Getting Started

Install

The synthgauge package is available on PyPI and can be installed via pip in the standard way:

$ python -m pip install synthgauge

If you'd rather install the package from source, you can do so by first cloning this repository from GitHub. The synthgauge package is configured using setup.cfg, which requires newer versions of pip, setuptools and wheel. Be sure to update these if you haven't for a while.

$ cd /path/to/synthgauge
$ python -m pip install --upgrade pip setuptools wheel
$ python -m pip install .

Now you're ready to start using the package!

Usage

To help users get acquainted with the package, an example Jupyter Notebook is included in the :open_file_folder: examples directory. This notebook is also available in the package documentation.

The following shows an example workflow for evaluating a single real-synthetic dataset pair.

>>> import synthgauge as sg
>>>
>>> # 1. Create or read in some data as a `pandas.DataFrame`
>>> real = sg.datasets.make_blood_types_df(noise=0, nan_prop=0, seed=0)
>>> synth = sg.datasets.make_blood_types_df(noise=1, nan_prop=0, seed=0)
>>>
>>> # 2. Instantiate an Evaluator object
>>> ev = sg.Evaluator(real, synth)
>>>
>>> # 3. Explore the data
>>> ev.describe_numeric()
               count     mean        std    min    25%    50%    75%    max
age_real      1000.0   41.745   7.073472   22.0   37.0   41.0   48.0   62.0
age_synth     1000.0   41.536   9.195829   18.0   35.0   41.0   48.0   68.0
height_real   1000.0  174.976   7.771346  153.0  169.0  176.0  181.0  194.0
height_synth  1000.0  175.266   9.633070  147.0  168.0  176.0  182.0  205.0
weight_real   1000.0   80.014   9.455115   56.0   74.0   80.0   86.0  114.0
weight_synth  1000.0   80.117  11.113452   50.0   72.0   80.0   88.0  118.0
>>> ev.describe_categorical()
                  count unique most_frequent freq
blood_type_real    1000      4             O  384
blood_type_synth   1000      4             A  535
eye_colour_real    1000      3         Brown  577
eye_colour_synth   1000      3         Brown  664
hair_colour_real   1000      4         Brown  435
hair_colour_synth  1000      4         Brown  480
>>> ev.plot_histograms(figsize=(12,12))
<Figure size 1200x1200 with 6 Axes>
>>>
>>> # 4. Add metrics to compute
>>> ev.add_metric('wasserstein', 'wass-age', feature='age')
>>>
>>> # 5. Evaluate the metrics and review the results
>>> results = ev.evaluate()
>>> print(results)
{'wass-age': 1.7610000000000001}

Further Help

The API is described in the reference documentation. Please direct any questions to datacampus@ons.gov.uk.

Contributing

If you encounter any bugs as part of your usage of synthgauge, please file an issue detailing as much information as possible and include a minimal reproducible example. If you wish to contribute code to the project, please refer to the contribution guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthgauge-2.2.0.tar.gz (38.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthgauge-2.2.0-py3-none-any.whl (43.2 kB view details)

Uploaded Python 3

File details

Details for the file synthgauge-2.2.0.tar.gz.

File metadata

  • Download URL: synthgauge-2.2.0.tar.gz
  • Upload date:
  • Size: 38.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for synthgauge-2.2.0.tar.gz
Algorithm Hash digest
SHA256 e0a4e91ff073c51c17b5ad51ab34212e35e6b1a8d3572a1faba81630fa14b819
MD5 fd894a073ba53ea344f68dbf87266c0f
BLAKE2b-256 540d8bd3052c16cf30bd5ed2de640ca38c71459365c4409f2233ae710b804363

See more details on using hashes here.

File details

Details for the file synthgauge-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: synthgauge-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for synthgauge-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee06a31baa0844fa753ac3cd40730b0c9f71149ad4723ee87de0934779cc1a1b
MD5 6cdae979aeda541716ca8ac6a5c23d9d
BLAKE2b-256 708447129e477c018940c82eee22af7b18d72db9337892f8d53f7b6e1fef88f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page