Framework for applying differential privacy to large datasets using batch processing systems
Project description
PipelineDP
PipelineDP is a framework for applying differential privacy to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
To make differential privacy accessible to non-experts, PipelineDP:
- provides a convenient API familiar to Spark or Beam developers
- encapsulates the complexities of differential privacy, such as
- protection of outliers and rare categories, generation of safe noise and privacy budget accounting;
- supports many standard computations, such as count, sum, average, and is easily extensible to support other aggregation types.
The project is in the early development stage. More description will be added later.
Getting started
Take a look on examples of how to run PipelineDP Apache Spark or Beam:
Here's a code sample for introducing how the private processing code on Spark looks like. For a more complete guide, please take a look on the examples above, or on the API guide.
# Define the privacy budget available for our computation.
budget_accountant = pipeline_dp.NaiveBudgetAccountant(total_epsilon=1,
total_delta=1e-6)
# Wrap Spark's RDD into it's private version. You will use this private wrapper
# for all further processing instead of the Spark's RDD. Using the wrapper ensures
# that only private statistics can be released.
private_movie_views = \
make_private(movie_views, budget_accountant, lambda mv: mv.user_id)
# Calculate the private sum of ratings per movie
dp_result = private_movie_views.sum(
SumParams(max_partitions_contributed=2,
max_contributions_per_partition=2,
min_value=1,
max_value=5,
# Specifies the aggregation key
partition_extractor=lambda mv: mv.movie_id,
# Specifies the value we're aggregating
value_extractor=lambda mv: mv.rating)
)
budget_accountant.compute_budgets()
# Save the results
dp_result.saveAsTextFile(FLAGS.output_file)
Development
To install the requirements for local development, run make dev
.
Please run make precommit
to auto-format, lint check, and run tests.
Individual targets are format
, lint
, test
, clean
, dev
.
Style guide
Google Python Style Guide https://google.github.io/styleguide/pyguide.html
Installation
This project depends on numpy apache-beam pyspark absl-py dataclasses
For installing with poetry please run:
-
git clone https://github.com/OpenMined/PipelineDP.git
-
cd PipelineDP/
-
poetry install
For installing with pip please run:
-
pip install numpy apache-beam pyspark absl-py
-
(for python 3.6)
pip install dataclasses
Running end-to-end example
For the development it is convenient to run an end-to-end example.
For doing this:
-
Download Netflix prize dataset from https://www.kaggle.com/netflix-inc/netflix-prize-data and unpack it.
-
The dataset itself is pretty big, for speed-up the run it's better to use a part of it. You can generate a part of it by running in bash:
head -10000 combined_data_1.txt > data.txt
or by other way to get a subset of lines from the dataset.
-
Run python movie_view_ratings.py --input_file=<path to data.txt from 2> --output_file=<...>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for pipeline_dp-0.0.1rc1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37c0bc8c3f952789894a8aa02d5c67b2c7facb17afb8ace80f2191e2ffc7a860 |
|
MD5 | 1f72fbf250961f93538dae32d8a47f14 |
|
BLAKE2b-256 | 4a0608f3ca96aa0c899f7e407fed7ae1129392c0f1931a7f51a0bdf84646eb89 |