Skip to main content

dpmm: a library for synthetic tabular data generation with rich functionality and end-to-end Differential Privacy guarantees

Project description

dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation

Overview

dpmm is a Python library that implements state-of-the-art Differentially Private Marginal Models for generating synthetic tabular data. Marginal Models have consistently been shown to capture key statistical properties like marginal distributions from the original data and reproduce them in the synthetic data, while Differential Privacy (DP) ensures that individual privacy is rigorously protected.

Summary of main features:

  • end-to-end DP pipelines including data preprocessing, generative models, and mechanisms:
    • DP data preprocessing -- 1) data domain is either provided as input or extracted with DPpaper, and 2) continous data is discretized with DP (Uniform and PrivTreepaper)
    • state-of-the-art DP generative models relying on the select-measure-generate paradigmpaper1,paper2 and Private-PGMpaper -- PrivBayespaper, MSTpaper, and AIMpaper
    • floating-point precision of DP mechanismspaper
  • superior utility and performance
  • rich functionality across all models/pipelines
  • DP auditing of underlying mechanisms and models/pipelinespaper1,paper2

NB: Intended Use -- dpmm is designed for research and exploratory use in privacy-preserving synthetic data generation (particularly in simple scenarios such as preserving high-quality 1/2-way marginals in datasets with up to 32 featurespaper1,paper2) and is not intended for production use in complex, real-world applications.

Installation

Prerequisites

  • Python 3.10 or 3.11

PyPi install

You can also install from PyPi by running:

pip install dpmm

Local Install

To install from the local github repo run the following command:

git clone git@github.com:sassoftware/dpmm.git
cd dpmm
poetry install

Tests

To run the unit tests, go to the root of the repository (if installed locally), and use the following command:

pytest tests/

Functionality

We provide numerous examples demonstrating the features of dpmm across data preprocssing as well as the training and generation of generative models. The examples are available across all models and model settings, and are accessible from the repository (if installed locally).

Preprocessing

The provided generative pipelines combine automatic DP descritization preprocessing with a generative model and allows for the following features:

Feature Description Example
dtype support the following pandas data types are supported natively: datetime, timedelta, float, int, category, bool. Dtypes example
null-value support missing values are supported and will be reproduced accordingly if present in any column within the real data.
automatic discretisation while the default discretisation strategy used by dpmm is priv-tree a more typical uniform strategy is also availble, they can both be combined with an 'auto' mode which will attempt to identify the optimal number of bins for each numerical column column.

Model Features

Feature Description Example
domain compression a compress flag can be set to True to ensure the discretised domain is compressed to improve the privacy budget / data quality trade-off.
model size control a max_model_size parameter that ensures the memory footprint of the selected marginals remains lower than the specified upper threshold. Max Memory example
model serialisation pipelines can be serialised to / deserialised from disk by provided a valid folder to store the model to. Serialisation example

Generation Features

Feature Description Example
conditional generation at generation time, it is also possible to provide a partial dataframe containing only some of the columns, in that case the generative pipeline will conditionally generate the remaining columns. Conditional Generation example
deterministic generation when a random_state value is provided at generation time, the generative process becomes deterministic assuming the same input parameters are provided. Random State example

Models

The implemented models include:

Method Description Reference Example
PrivBayes+PGM Differentialy Private Bayesian Network. PrivBayes: Private Data Release via Bayesian Networks PrivBayes example
MST Maximum Spanning Tree. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data MST example
AIM Adaptive and Iterative Mechanism. AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data AIM example

NB: All models rely on the select-measure-generate paradigmpaper1,paper2 and Private-PGMpaper.

Getting Started

To get started with using the dpmm, follow the steps below:

  1. Import the necessary modules and load your data:

    import pandas as pd
    import json
    from dpmm.pipelines import MSTPipeline
    
    
    wine_dir = Path().parent / "wine"
    
    df = pd.read_pickle(wine_dir / "wine.pkl.gz")
    with (wine_dir / "wine_bounds.json").open("r") as f:
       domain = json.load(f)
    
  2. Initialize and fit a model:

    model = MSTPipeline(
       # Generator Parameters
       epsilon=1.0, 
       delta=1e-5,
       # Discretiser Parametrs
       proc_epsilon=0.1,
    )
    model.fit(df, domain)
    
  3. Generate synthetic data:

    synth_df = model.generate(n_records=100)
    print(synth_df)
    """
          type  fixed acidity  volatile acidity  citric acid  residual sugar   chlorides free sulfur dioxide  total sulfur dioxide   density        pH   sulphates    alcohol quality  
       0  white       5.288142          0.190330     0.212473        1.402665    0.032305            37.097305             60.585301  0.990234  2.998241    0.658841  12.467682       1  
       1  white       5.956364          0.225099     0.210124       15.968057    0.043620            70.073909            202.689578  0.995807  3.198247    0.318414  10.290390       0  
       2  white       5.315535          0.341091     0.247268        0.628240    0.024938            52.468176            104.892353  0.990975  3.161218    0.971699  11.181373       1  
       3  white       7.879125          0.234170     0.275704        3.711610    0.039565            68.977194            163.380550  1.005989  3.068622    0.798520   8.075999       0  
       4  white       6.981342          0.358461     0.337705        3.600390    0.050450            51.567452            134.896467  0.996149  3.272745    0.599021  10.200400       0  
    
    """
    

Troubleshooting

If you encounter any issues, please check the following:

  • Ensure that all required packages are installed.
  • Verify that your data does not contain missing values or non-integer columns if using certain models.
  • Check the model parameters and ensure they are set correctly.

Contributing

Maintainers are accepting patches and contributions to this project. Please read CONTRIBUTING.md for details about submitting contributions to this project.

License

This project is licensed under the Apache 2.0 License. This project also uses code snippets from the following projects:

Additional Resources

Citing

If you use this code, please cite the associated paper:

@inproceedings{mahiou2025dpmm,
  title={{dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation}},
  author={Mahiou, Sofiane and Dizche, Amir and Nazari, Reza and Wu, Xinmin and Abbey, Ralph and Silva, Jorge and Ganev, Georgi},
  booktitle={TPDP},
  year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpmm-0.1.6.tar.gz (62.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpmm-0.1.6-py3-none-any.whl (77.7 kB view details)

Uploaded Python 3

File details

Details for the file dpmm-0.1.6.tar.gz.

File metadata

  • Download URL: dpmm-0.1.6.tar.gz
  • Upload date:
  • Size: 62.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/6.6.71-0-virt

File hashes

Hashes for dpmm-0.1.6.tar.gz
Algorithm Hash digest
SHA256 b0db75396339be4b9188245587028af5a1522c68acbaf40a10357b1c6faffc1b
MD5 2f7f12010d1b4701326ee63bc66ed89c
BLAKE2b-256 e9b36f51b66d3e563dcd6dd249f2aec561920237c8f133580f9ee6a43a67bbf0

See more details on using hashes here.

File details

Details for the file dpmm-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: dpmm-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 77.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/6.6.71-0-virt

File hashes

Hashes for dpmm-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b0312b1247beaee69ea186e238d72bf77deb592a9e708461d008396e227ebca2
MD5 50f40156b2e948e89bcda9a9677451aa
BLAKE2b-256 80ae6e4b2b177527b7df04a16b0e21bd03a3c6533bd3e940c537751fe585374b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page