Package for creating synthetic datasets while preserving privacy.
Project description
Transparent and privacy-friendly synthetic data generation
Metasyn is a Python package that generates synthetic data, and allows sharing of the data generation model, to facilitate collaboration and testing on sensitive data without exposing the original data.
It has three main functionalities:
- Estimation: Metasyn can analyze a dataset and create a MetaFrame for it. This is essentially a blueprint (or data generation model) that captures the structure and distributions of the columns without storing any entries.
- Generation: From a MetaFrame, metasyn can generate new synthetic data that resembles the original, on a column-by-column basis.
- Serialization: Metasyn can export and import MetaFrames to an easy-to-read format. This allows for easy modification and sharing of the model.
Why Metasyn?
- Privacy: With metasyn you can share not only synthetic data, but also the model used to create it. This increases transparency and facilitates collaboration and testing on sensitive data without exposing the original data.
- Extensible: Metasyn is designed to be easily extendable and customizable and supports plugins for custom distributions and privacy control.
- Faker: Metasyn integrates with the Faker plugin to generate real-sounding entries for names, emails, phone numbers, etc.
- DataFrame-based: Metasyn is built on top of Polars, and supports both Polars and Pandas DataFrames as input.
- Flexibility: Metasyn supports a variety of distribution and data types and can automatically select and fit to them. It also supports and detects columns with unique values or structured strings.
- Ease of use: Metasyn is designed to be easy to use and understand.
Example
The following diagram shows how metasyn can generate synthetic data from an input dataset:
This can be reproduced using the following code:
# Create a Polars DataFrame. In this case we load it from a csv file.
# It is important to specify which categories are categorical, as Polars does not infer this automatically.
df = pl.read_csv("example.csv", dtypes={"fruits": pl.Categorical, "cars": pl.Categorical})
# Create a MetaFrame from the DataFrame.
mf = MetaFrame.fit_dataframe(df)
# Generate a new DataFrame, with 5 rows data from the MetaFrame.
output_df = mf.synthesize(5)
# This DataFrame can be exported to csv, parquet, excel and more. E.g., to csv:
output_df.write_csv("output.csv")
This example is the most basic use case, as a next step we recommend to check out the User Guide for more detailed examples or to follow along our interactive tutorial.
For more information on how to use Polars DataFrames, refer to the Polars documentation.
Installing metasyn
Metasyn can be installed directly from PyPI using the following command in the terminal:
pip install metasyn
After that metasyn is available to use in your Python scripts and notebooks. It will also be accessible through its command-line interface. It is also possible to run and access metasyn's CLI through a Docker container available on Docker Hub.
For more information on installing metasyn, refer to the installation guide.
Documentation and help
- Documentation: For a detailed overview of metasyn, refer to the documentation.
- Quick-start: Our quick start guide acts as a crash-course on the functionality and workflow of metasyn.
- Interactive tutorial Our interactive tutorial (Jupyter Notebook) follows and expands on the quick start guide, providing a step-by-step walkthrough and example to get you started. This tutorial can be followed without having to install metasyn locally by running it in Google Colab or Binder.
Contributing
Metasyn is an open-source project, and we welcome contributions from the community.
To contribute to the codebase, follow these steps:
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
More information on contributing can be found in the contributing section of the documentation.
Contact
Metasyn is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren or Raoul Schram.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.