DPLab: Benchmarking Differential Privacy Aggregation Operations
Project description
DPLab: Benchmarking Differential Privacy Aggregation Operations
This repo targets to provide a unified interface to access and evaluate the same aggregation functionalities in different open-source differential privacy (DP) libraries. With a simple CLI, one can choose the library, the aggregation function, and many other experimental parameters and apply the specified DP measurement to data stored in a .csv
file. The repo also provides both synthetic and real-world example datasets for evaluation purposes. Evaluation results are stored in a .json
file and metrics are provided for repeated experiments. The repo also provides a CLI tool to generate configuration groups for larger-scale comparison experiments.
Currently supported aggregation operations:
- COUNT
- SUM
- MEAN
- VAR
- MEDIAN
- QUANTILE
Currently supported libraries:
- diffprivlib 0.5.2 [Homepage] [Example Usage]
- python-dp 1.1.1 [Homepage] [Example Usage]
- opendp 0.5.0 [Homepage] [Example Usage]
- tmlt.analytics 0.4.1 [Homepage] [Example Usage]
- chorus [Homepage] [Example Usage]
Installation
Clone the repo, switch the working directory, and install the dependencies
git clone git@github.com:camelop/dp_lab.git
cd dp_lab
pip install -e .
To use tmlt
export PYSPARK_PYTHON=/usr/bin/python3
sudo apt install openjdk-8-jre-headless
pip3 install -i https://d3p0voevd56kj6.cloudfront.net python-flint
pip3 install tmlt.analytics
To use chorus, please make sure you have Java runtime installed.
How to run experiments in the benchmark
Generate the experiment commands, this will generate an ./exp.db.json
file under the working directory (you can also use --location
to specify a different place).
dplab_exp plan --repeat 100 --group_num 100
Queue the experiments for execution
dplab_exp launch --debug
How to run dp libraries in the benchmark
Run a specific library with the CLI
dplab_run <library> <operation> <input_file> <output_file> <other options>
For example:
dplab_run pydp sum data/1.csv data/1.json -f -r 1000
Other options include:
mode
: Evaluation mode, one can choose from "plain" (no timing/mem measurement), "internal" (internal measurement), or "external" (external tracking).epsilon
: DP parameter.quant
: Quantile value for QUANTILE operation, a float number between 0 and 1.repeat
: How many time should the evaluation repeat.force
: Force to overwrite the output file.debug
: Include debugging information in the output file.python_command
: Python command used to run the script in the external mode.external_sample_interval
: timing/mem consumption sample interval in the external mode.
For more information, please check the main entry file.
Generating synthetic data
# Make sure you are in the root directory of the repo
# Data will be generated in the ./data/ directory
# The procedure will generate about 28GB of data
# To avoid the risk of running out of disk space, you can comment out the performance test lines (Line26-27) in SYN_TARGETS defined in the script
python3 scripts/gen_data.py
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.