Package for creating synthetic datasets while preserving privacy.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PyPI - Python Version

Metasyn Logo

Metasyn

Metasyn is a Python package designed to generate tabular synthetic data for rigorous code testing and reproducibility. Researchers and data owners can use metasyn to generate and share synthetic versions of their sensitive datasets, mitigating privacy concerns. Additionally, metasyn facilitates transparency and reproducibility, by allowing the underlying MetaFrames to be exported and shared. Other researchers can use these to regenerate consistent synthetic datasets, validating published work without requiring sensitive data.

The package has three main functionalities:

Estimation: Metasyn can create a MetaFrame, from a dataset. A MetaFrame is essentially a fitted model that characterizes the structure of the original dataset without storing actual values. It captures individual distributions and features, enabling generation of synthetic data based on these MetaFrames and can be seen as (statistical) metadata.
Serialization: Metasyn can export a MetaFrame into an easy to read JSON file, allowing users to audit, understand, and modify their data generation model.
Generation: Metasyn can generate synthetic data based on a MetaFrame. The synthetic data produced solely depends on the MetaFrame, thereby maintaining a critical separation between the original sensitive data and the synthetic data generated. The generated synthetic data, emulates the original data's format and plausibility at the individual record level and attempts to reproduce marginal (univariate) distributions where possible. Generated values are based on the observed distributions while adding a degree of variance and smoothing. The generated data does not aim to preserve the relationships between variables. The frequency of missing values and their codes are maintained in the synthetically-augmented dataset.

Metasyn Pipeline

Key features

MetaFrame Generation: Metasyn allows the creation of a MetaFrame from a dataset provided as a Polars or Pandas DataFrame. MetaFrames includes key characteristics such as variable names, data types, the percentage of missing values, and distribution parameters.

Exporting MetaFrames: Metasyn can export and import MetaFrames to GMF files. These are JSON files that follow the easy to read and understand Generative Metadata Format (GMF).

A simple example of an exported MetaFrame (following the GMF standard):

{
    "n_rows": 5,
    "n_columns": 5,
    "provenance": {
        "created by": {
            "name": "Metasyn",
            "version": "0.4.0"
        },
        "creation time": "2023-08-07T12:04:40.669740"
    },
    "vars": [
        {
            "name": "ID",
            "type": "discrete",
            "dtype": "Int64",
            "prop_missing": 0.0,
            "distribution": {
                "implements": "core.unique_key",
                "provenance": "builtin",
                "class_name": "UniqueKeyDistribution",
                "parameters": {
                    "low": 1,
                    "consecutive": 1
                }
            }
        },
        {
            "name": "fruits",
            "type": "categorical",
            "dtype": "Categorical",
            "prop_missing": 0.0,
            "distribution": {
                "implements": "core.multinoulli",
                "provenance": "builtin",
                "class_name": "MultinoulliDistribution",
                "parameters": {
                    "labels": [
                        "apple",
                        "banana"
                    ],
                    "probs": [
                        0.4,
                        0.6
                    ]
                }
            }
        },
        {
            "name": "B",
            "type": "discrete",
            "dtype": "Int64",
            "prop_missing": 0.0,
            "distribution": {
                "implements": "core.poisson",
                "provenance": "builtin",
                "class_name": "PoissonDistribution",
                "parameters": {
                    "mu": 3.0
                }
            }
        },
        {
            "name": "cars",
            "type": "categorical",
            "dtype": "Categorical",
            "prop_missing": 0.0,
            "distribution": {
                "implements": "core.multinoulli",
                "provenance": "builtin",
                "class_name": "MultinoulliDistribution",
                "parameters": {
                    "labels": [
                        "audi",
                        "beetle"
                    ],
                    "probs": [
                        0.2,
                        0.8
                    ]
                }
            }
        },
        {
            "name": "optional",
            "type": "discrete",
            "dtype": "Int64",
            "prop_missing": 0.2,
            "distribution": {
                "implements": "core.discrete_uniform",
                "provenance": "builtin",
                "class_name": "DiscreteUniformDistribution",
                "parameters": {
                    "low": -30,
                    "high": 301
                }
            }
        }
    ]
}

A more advanced example GMF, based on the Titanic dataset, can be found here

Synthetic Data Generation: Metasyn allows for the generation of a polars DataFrame with synthetic data that resembles the original data.
Distribution Fitting: Metasyn allows for manual and automatic distribution fitting.
Data Type Support: Metasyn supports generating synthetic data for a variety of common data types including categorical, string, integer, float, date, time, and datetime.
Integration with Faker: Metasyn integrates with the faker package, a Python library for generating fake data such as names and emails. Allowing for synthetic data that is formatted realistically, while retaining privacy.
Structured String Detection: Metasyn identifies structured strings within your dataset, which can include formatted text, codes, identifiers, or any string that follows a specific pattern.
Handling Unique Values: Metasyn can identify and process variables with unique values or keys in the data, preserving their uniqueness in the synthetic dataset, which is crucial for generating synthetic data that maintains the characteristics of the original dataset.

Curious and want to learn more? Check out ourdocumentation!

Getting Started

Try it out online

If you're new to Python or simply want to quickly explore the basic features of metasyn, you can try it out using the online Google Colab tutorial. Click here to access the tutorial. It provides a step-by-step walkthrough and example dataset to help you get started. However, please exercise caution when using sensitive data, as it will be handled through Google servers.

Local Installation

For more advanced users and researchers who prefer working on their local machines, you can install metasyn directly from PyPI using the following command in the terminal (not Python):

pip install metasyn

Usage

To learn how to use metasyn effectively, refer to the comprehensive documentation. The documentation covers all the necessary information and provides detailed explanations, examples, and usage guidelines.

Additionally, the documentation offers a series of tutorials that delve into specific features and use cases. These tutorials can further assist you in understanding and leveraging the capabilities of metasyn.

Quick start

Get started quickly with metasyn using the following example. In this concise demonstration, you'll learn the basic functionality of metasyn by generating synthetic data from titanic dataset.

It is important to start by importing the appropriate libraries:

# import libraries
import polars as pl
from metasyn import MetaFrame, demo_file

Estimation: Generating a MetaFrame

1. Begin by creating a polars dataframe:

# import the demo csv 
dataset_csv = demo_file() # This function automatically loads the Titanic dataset (as found here )

# create dataframe
data_types = {
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical,
    "Survived": pl.Categorical,
    "Pclass": pl.Categorical,
    "SibSp": pl.Categorical,
    "Parch": pl.Categorical
}

df = pl.read_csv(dataset_csv, dtypes=data_types)

Note on using Pandas

Internally, metasyn uses Polars (instead of Pandas) mainly because typing and the handling of non-existing data is more consistent. It is possible to supply a Pandas DataFrame instead of a polars DataFrame to MetaFrame.fit_dataframe. However, this uses the automatic polars conversion functionality, which for some edge cases result in problems. Therefore, we advise users to create Polars DataFrames. The resulting synthetic dataset is always a polars dataframe, but this can be easily converted back to a Pandas DataFrame by using df_pandas = df_polars.to_pandas().

2. Next, we can generate a MetaFrame from the polars DataFrame.

# create a MetaFrame (mf) from the DataFrame (df)
mf = MetaFrame.fit_dataframe(df)

Note: At this point you will encounter a warning about PassengerId not being set as unique, you can safely ignore it and proceed. This warning occurs because PassengerId appears to contain unique values, but is not explicitly marked as a unique column. To remove the warning, you can set PassengerId to be a unique column. Our documentation explains how to do this when generating Metaframes: Set Columns as Unique.

(De)serialization: Exporting and importing a MetaFrame

Note that exporting and importing is optional. You can generate synthetic data from any loaded MetaFrame, whether that be through importing a GMF file or generating a MetaFrame from an original DataFrame.

3. We can export this MetaFrame to a GMF file using:

# export MetaFrame
mf.export("exported_metaframe.json")

4. Similarly, we can import a MetaFrame from a GMF file using:

# load MetaFrame
mf = MetaFrame.from_json("exported_metaframe.json")

Generation: Generating synthetic data

5. Finally, we can generate a DataFrame with synthetic data based on a MetaFrame using:

# synthesize a DataFrame with 5 rows of data based on a MetaFrame
synthetic_data = mf.synthesize(5)

Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create.

Any contributions you make are greatly appreciated.

To contribute:

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contact

Metasyn is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren or Raoul Schram.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.0

May 13, 2024

0.8.0

Mar 25, 2024

0.7.1

Feb 28, 2024

This version

0.7.0

Feb 15, 2024

0.6.0

Sep 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metasyn-0.7.0.tar.gz (1.9 MB view hashes)

Uploaded Feb 15, 2024 Source

Built Distribution

metasyn-0.7.0-py3-none-any.whl (84.4 kB view hashes)

Uploaded Feb 15, 2024 Python 3

Hashes for metasyn-0.7.0.tar.gz

Hashes for metasyn-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`1b07c721876a15e8cd39ece5b8490e831d89f8faa9ccd70c43ece41f88d0f165`
MD5	`3c886998e81bead5e4eeb8a75c31b94d`
BLAKE2b-256	`f3b92ac3d35a3d176939dbe701aa4f20719c36f23fc904d7167eba98b3b77de0`

Hashes for metasyn-0.7.0-py3-none-any.whl

Hashes for metasyn-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6758e1f023f12d733b76728ca165c5cb33d1c590cce6bccb7edc790cd58d4879`
MD5	`e6e870d6b527b31123e99120f522223a`
BLAKE2b-256	`a24cfc41d155571659881bd5df549c0e1fcf8cb848b71c30a5de42481c1d5310`