Graphical-model based estimation and inference for differential privacy.

These details have not been verified by PyPI

Project description

Graphical-model based estimation and inference for differential privacy

FORKED FROM: https://github.com/ryan112358/private-pgm/tree/master/src/mbi

Published articles:

McKenna, Ryan, Daniel Sheldon, and Gerome Miklau. 2021. "Winning the NIST Contest: A scalable and general approach to differentially private synthetic data." Journal of Privacy and Confidentiality 11 (3).
McKenna, Ryan, Daniel Sheldon, and Gerome Miklau. "Graphical-model based estimation and inference for differential privacy." In Proceedings of the 36th International Conference on Machine Learning. 2019. https://arxiv.org/abs/1901.09136

Important Update

For anyone interested in using this code-base, the "examples" folder has been reorganized. Specifically, the files in the "examples" folder are meant to demonstrate how to use Private-PGM for different problems. These files are typically not designed to run on arbitrary datasets and/or workloads. They are more useful if you are trying to learn how to use this code base and would like to build your own mechanisms on top of it.

If you would simply like to compare against mechanisms that are built on top of Private-PGM, please refer to the "mechanisms" folder. These will contain implementations of several mechanisms that are designed to work on a broad class of datasets and workloads. The mechanisms currently available here are:

AIM - An adaptive and iterative mechanism
MST - Winning solution to the 2018 NIST synthetic data challenge.
Adaptive Grid - Second place solution to the 2020 NIST synthetic data challenge.
MWEM+PGM - A scalable instantiation of the MWEM algorithm for marginal query workloads.
HDMM+APPGM - A scalable instantiation of the HDMM algorithm for marginal query workloads.

For the methods above, MST and Adaptive Grid are workload agnostic. Nevertheless, we expect them to do well on general workloads like all 3-way marginals. MWEM+PGM, HDMM+APPGM, and AIM are all workload aware, and are designed to offer low error on the marginals specified in the workload.

NOTE: The first three mechanisms produce synthetic data, but HDMM+APPGM only produces query answers, not synthetic data.

NOTE: As this is research (non-production) code, all of the mechanisms in this repository use np.random.normal or np.random.laplace for noise generation. These floating-point approximations of the Gaussian/Laplace mechanism are susceptible to floating point attacks like those described by Mironov. If you would like to use these mechanisms in a production system that is robust to such attacks, we recommend you use an implementation of these mechanisms that was specifically designed with floating point artifacts in mind, such as the Discrete Gaussian Mechanism, OpenDP, or Tumult Analytics. For the purposes of evaluating the utility of our mechanisms in research code, our implementations are fine to use, i.e., the floating point vulnerability does not meaningfully change the utility of the mechansim in any way.

Toy example

Suppose we have a unknown data distribution P(A,B,C) defined over three variables and we invoke a noise-addition mechanism to get a noisy answer to the two-way marginals P(A,B) and P(B,C). We want to use this information to recover a representation of the data distribution that approximates the true data with respect to the measured marginals. We can do this as follows:

First, load synthetic data with 1000 rows over a domain of size 2 x 3 x 4

>>> from tmlt.private_pgm import Dataset, Domain, FactoredInference
>>> import numpy as np
>>> 
>>> domain = Domain(['A','B','C'], [2,3,4])
>>> data = Dataset.synthetic(domain, 1000)

Then measure the AB marginal and BC marginal using the Laplace mechanism

>>> epsilon = 1.0
>>> sigma = 1.0 / epsilon
>>> ab = data.project(['A','B']).datavector()
>>> bc = data.project(['B','C']).datavector()
>>> yab = ab + np.random.laplace(loc=0, scale=sigma, size=ab.size)
>>> ybc = bc + np.random.laplace(loc=0, scale=sigma, size=bc.size)

Now feed these noisy measurements into the inference engine using the Mirror Descent (MD) algorithm

>>> Iab = np.eye(ab.size)
>>> Ibc = np.eye(bc.size)
>>> measurements = [(Iab, yab, sigma, ('A','B')), (Ibc, ybc, sigma, ('B','C'))]
>>> engine = FactoredInference(domain, log=True)
>>> model = engine.estimate(measurements, engine='MD')

Now model can be used as if it were the true data to answer any new queries

>>> ab2 = model.project(['A','B']).datavector()
>>> bc2 = model.project(['B','C']).datavector()
>>> ac2 = model.project(['A','C']).datavector()

Automatic Setup

$ pip install git+https://github.com/ryan112358/private-pgm.git

Manual Setup

We officially support python3, and have the following dependencies: numpy, scipy, pandas, matplotlib, and networkx. These can be installed with pip as follows:

$ pip install -r requirements.txt

Additionally, you have to add the src folder to the PYTHONPATH. If you are using Ubuntu, add the following line to your .bashrc file:

PYTHONPATH=$PYTHONPATH:/path/to/private-pgm/src

This allows you to import modules from this package like from tmlt.private_pgm import FactoredInference no matter what directory you are working in. Once this is done, check to make sure the tests are passing

$ cd /path/to/private-pgm/test
$ nosetests
........................................
----------------------------------------------------------------------
Ran 40 tests in 5.009s

OK

PyTorch

In addition to the above setup, if you have access to a GPU machine you can use PyTorch to accelerate the computations for the Inference engine. This requires changing only one line of code:

>>> engine = FactoredInference(domain, backend='torch', log=True)

See https://pytorch.org/ for instructions to install PyTorch in your environment.

Documentation

This package contains the following public-facing classes: Domain, Dataset, Factor, GraphicalModel, and FactoredInference.

Domain: contains information about the attributes in the dataset and the number of possible values for each attribute. NOTE: It is implicitly assumed that the set of possible values for an attribute is { 0, ..., n-1 }.
Dataset: a class for storing tabular data. Can convert to the vector representation of the data by calling datavector() and can project the data onto a subset of attributes by calling project(). NOTE: This class requires the underlying data to conform to the domain (i.e., the set of possible values for an attribute should be { 0, ..., n-1 }).
Factor: A representation of a multi-dimensional array that also stores domain information. Is used by GraphicalModel.
GraphicalModel: A factored representation of a probability distribution that allows for efficient calculation of marginals and other things. The interface for GraphicalModel is similar to Dataset, so they can be used in the same way (i.e., you can project() onto a subset of attributes and obtain the datavector()).
FactoredInference: A class for performing efficient estimation/inference. This class contains methods for estimating the data distribution (with a GraphicalModel) that approximates some true underlying distribution with respect to noisy measurements over the marginals. Measurements must be represented as a list of 4-tuples: (Q, y, noise, proj), where
- proj (tuple): is a subset of attributes corresponding the marginal the measurements were taken over.
- Q (matrix): is the measurement matrix (can be a numpy array, scipy sparse matrix, or any subclass of scipy.sparse.LinearOperator).
- y (vector) is the noisy answers to the measurement queries (should be a numpy array).
- noise (scalar): is the standard deviation of the noise added to y.

The estimate() method allows you to estimate the data distribution from noisy measurements in this form. Optional arguments are total (if using bounded differential privacy) and engine (to specify which estimation algorithm should be used). This class also has a number of other optional arguments, including structural_zeros (if some combinations of attribute settings are impossible), metric to specify the marginal loss function (L1, L2, or custom), log (to display progress of estimation), iters (number of iterations), and some others.

In addition to these public-facing classes, there are utilities such as mechanism (for running end-to-end experiments) and callbacks (for customized monitoring of the estimation procedure).

Examples

Additional examples can be found in the examples folder. In addition to the toy example above, there are a number of more realistic examples using the adult dataset. The file adult_example is similar to toy_example, but is bigger and slightly more complicated. The file torch_example shows how the torch backend can be used. The file convergence compares the rate of convergence for the different estimation algorithms.

The files hdmm, privbayes, mwem, and dualquery show how to use our technique to improve these existing algorithms. These files have two additional dependencies: Ektelo and autograd.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1a2 pre-release

Jan 30, 2025

This version

0.1.0a4 pre-release

Oct 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tmlt_private_pgm-0.1.0a4.tar.gz (35.8 kB view details)

Uploaded Oct 23, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tmlt_private_pgm-0.1.0a4-py3-none-any.whl (41.3 kB view details)

Uploaded Oct 23, 2024 Python 3

File details

Details for the file tmlt_private_pgm-0.1.0a4.tar.gz.

File metadata

Download URL: tmlt_private_pgm-0.1.0a4.tar.gz
Upload date: Oct 23, 2024
Size: 35.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: poetry/1.8.3 CPython/3.8.20 Linux/5.11.0-1022-aws

File hashes

Hashes for tmlt_private_pgm-0.1.0a4.tar.gz
Algorithm	Hash digest
SHA256	`ba89d6a545b20e1f8e1cfc098d0c6d2cb6f480d65e668d732b7ea6e553e53e6c`
MD5	`79f783f87df826f203de6d9fc531baaa`
BLAKE2b-256	`3178f2127afab95df5aa8a1f3b965cccd38378092da13a5ebd54c2a1650177e9`

See more details on using hashes here.

File details

Details for the file tmlt_private_pgm-0.1.0a4-py3-none-any.whl.

File metadata

Download URL: tmlt_private_pgm-0.1.0a4-py3-none-any.whl
Upload date: Oct 23, 2024
Size: 41.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: poetry/1.8.3 CPython/3.8.20 Linux/5.11.0-1022-aws

File hashes

Hashes for tmlt_private_pgm-0.1.0a4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebbb3180b6c8134a64f16272ea9e3231c1dea3fe6c3f3c46525d27794482e0b8`
MD5	`1ea86dc17a005a3b64d0dffa512cf3a2`
BLAKE2b-256	`41412b36a6f7c7d61c1b83c2e42a5ace852212bc2ac6814a44889de63671b9e3`

See more details on using hashes here.

tmlt.private_pgm 0.1.0a4

Navigation

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Graphical-model based estimation and inference for differential privacy

Important Update

Toy example

Automatic Setup

Manual Setup

PyTorch

Documentation

Examples

Project details

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes