Skip to main content

Package for creating synthetic datasets while preserving privacy.

Project description

PyPI - Python Version Binder docs

MetaSynth

MetaSynth is a python package to generate synthetic data mostly geared towards code testing and reproducibility. Using the ONS methodology MetaSynth falls in the augmented plausible category. To generate synthetic data, MetaSynth converts a polars DataFrame into a datastructure following the GMF standard file format. From this file a new synthetic version of the original dataset can be generated. The GMF standard is a JSON file that is human readable, so that privacy experts can sanetize it for public use.

Features

  • Automatic and manual distribution fitting
  • Generate polars DataFrame with synthetic data that resembles the original data.
  • Distributions for the most commonly used datatypes: categorical, string, integer, float, date, time and datetime.
  • Integrates with the faker package.
  • Structured string detection.
  • Variables that have unique values/keys.

Installation

You can install MetaSynth directly from PyPi by using the following command in the terminal (not Python):

pip install metasynth

Example

To process a dataset, first create a polars dataframe. As an example we will use the titanic dataset:

import polars as pl

dtypes = {
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical,
    "Survived": pl.Categorical,
    "Pclass": pl.Categorical,
    "SibSp": pl.Categorical,
    "Parch": pl.Categorical
}
df = pl.read_csv("titanic.csv", dtype=dtypes)

From the polars dataframe, we create a metadataset and store it in a JSON file that follows the GMF standard:

dataset = MetaDataset.from_dataframe(df)
dataset.to_json("test.json")

Note on pandas

Internally, MetaSynth uses polars (instead of pandas) mainly because typing and the handling of non-existing data is more consistent. It is possible to supply a pandas DataFrame instead of a polars DataFrame to MetaDataset.from_dataframe. However, this uses the automatic polars conversion functionality, which for some edge cases result in problems. Therefore, we advise users to create polars DataFrames. The resulting synthetic dataset is always a polars dataframe, but this can be easily converted back to a pandas DataFrame by using df_pandas = df_polars.to_pandas().

Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

To contribute:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contact

MetaSynth is project by the ODISSEI Social Data Science (SoDa) team.

Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren or Raoul Schram.

SoDa logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metasynth-0.2.2.tar.gz (199.3 kB view hashes)

Uploaded Source

Built Distribution

metasynth-0.2.2-py3-none-any.whl (36.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page