Skip to main content

Package for creating synthetic datasets while preserving privacy.

Project description

Metasyn logo

Transparent and privacy-friendly synthetic data generation

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. metasyn on pypi open getting started on colab Readthedocs Docker image version


Metasyn is a Python package that generates synthetic data, and allows sharing of the data generation model, to facilitate collaboration and testing on sensitive data without exposing the original data.

It has three main functionalities:

  1. Estimation: Metasyn can analyze a dataset and create a MetaFrame for it. This is essentially a blueprint (or data generation model) that captures the structure and distributions of the columns without storing any entries.
  2. Generation: From a MetaFrame, metasyn can generate new synthetic data that resembles the original, on a column-by-column basis.
  3. Serialization: Metasyn can export and import MetaFrames to an easy-to-read format. This allows for easy modification and sharing of the model.

Metasyn Pipeline

Why Metasyn?

  • Privacy: With metasyn you can share not only synthetic data, but also the model used to create it. This increases transparency and facilitates collaboration and testing on sensitive data without exposing the original data.
  • Extensible: Metasyn is designed to be easily extendable and customizable and supports plugins for custom distributions and privacy control.
  • Faker: Metasyn integrates with the Faker plugin to generate real-sounding entries for names, emails, phone numbers, etc.
  • DataFrame-based: Metasyn is built on top of Polars, and supports both Polars and Pandas DataFrames as input.
  • Flexibility: Metasyn supports a variety of distribution and data types and can automatically select and fit to them. It also supports and detects columns with unique values or structured strings.
  • Ease of use: Metasyn is designed to be easy to use and understand.

Example

The following diagram shows how metasyn can generate synthetic data from an input dataset:

Example input and output

This can be reproduced using the following code:

# Create a Polars DataFrame. In this case we load it from a csv file.
# It is important to specify which categories are categorical, as Polars does not infer this automatically.
df = pl.read_csv("example.csv", dtypes={"fruits": pl.Categorical, "cars": pl.Categorical})

# Create a MetaFrame from the DataFrame.
mf = MetaFrame.fit_dataframe(df)

# Generate a new DataFrame, with 5 rows data from the MetaFrame.
output_df = mf.synthesize(5)

# This DataFrame can be exported to csv, parquet, excel and more. E.g., to csv:
output_df.write_csv("output.csv")

This example is the most basic use case, as a next step we recommend to check out the User Guide for more detailed examples or to follow along our interactive tutorial.

For more information on how to use Polars DataFrames, refer to the Polars documentation.

Installing metasyn

Metasyn can be installed directly from PyPI using the following command in the terminal:

pip install metasyn

After that metasyn is available to use in your Python scripts and notebooks. It will also be accessible through its command-line interface. It is also possible to run and access metasyn's CLI through a Docker container available on Docker Hub.

For more information on installing metasyn, refer to the installation guide.

Documentation and help

  • Documentation: For a detailed overview of metasyn, refer to the documentation.
  • Quick-start: Our quick start guide acts as a crash-course on the functionality and workflow of metasyn.
  • Interactive tutorial Our interactive tutorial (Jupyter Notebook) follows and expands on the quick start guide, providing a step-by-step walkthrough and example to get you started. This tutorial can be followed without having to install metasyn locally by running it in Google Colab or Binder.

Contributing

Metasyn is an open-source project, and we welcome contributions from the community.

To contribute to the codebase, follow these steps:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

More information on contributing can be found in the contributing section of the documentation.

Contact

Metasyn is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren or Raoul Schram.

SoDa logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metasyn-0.7.1.tar.gz (2.6 MB view hashes)

Uploaded Source

Built Distribution

metasyn-0.7.1-py3-none-any.whl (282.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page