Skip to main content

Differentially Private SQL Queries

Project description

License: MIT Python

SmartNoise SQL

Differentially private SQL queries. Tested with:

  • PostgreSQL
  • SQL Server
  • Spark
  • Pandas (SQLite)
  • PrestoDB
  • BigQuery

SmartNoise is intended for scenarios where the analyst is trusted by the data owner. SmartNoise uses the OpenDP library of differential privacy algorithms.

Installation

pip install smartnoise-sql

Querying a Pandas DataFrame

Use the from_df method to create a private reader that can issue queries against a pandas dataframe. Example below uses datasets PUMS.csv and PUMS.yaml can be found in the datasets folder in the root directory.

import snsql
from snsql import Privacy
import pandas as pd
privacy = Privacy(epsilon=1.0, delta=0.01)

csv_path = 'PUMS.csv'
meta_path = 'PUMS.yaml'

pums = pd.read_csv(csv_path)
reader = snsql.from_df(pums, privacy=privacy, metadata=meta_path)

result = reader.execute('SELECT sex, AVG(age) AS age FROM PUMS.PUMS GROUP BY sex')

Querying a SQL Database

Use from_connection to wrap an existing database connection.

The connection must be to a database that supports the SQL standard, in this example the database must be configured with the name PUMS, have a schema called PUMS and a table called PUMS, and the data from PUMS.csv needs to be in that table.

import snsql
from snsql import Privacy
import psycopg2

privacy = Privacy(epsilon=1.0, delta=0.01)
meta_path = 'PUMS.yaml'

pumsdb = psycopg2.connect(user='postgres', host='localhost', database='PUMS')
reader = snsql.from_connection(pumsdb, privacy=privacy, metadata=meta_path)

result = reader.execute('SELECT sex, AVG(age) AS age FROM PUMS.PUMS GROUP BY sex')

Querying a Spark DataFrame

Use from_connection to wrap a spark session.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from snsql import *

pums = spark.read.load(...)  # load a Spark DataFrame
pums.createOrReplaceTempView("PUMS_large")

metadata = 'PUMS_large.yaml'

private_reader = from_connection(
    spark, 
    metadata=metadata, 
    privacy=Privacy(epsilon=3.0, delta=1/1_000_000)
)
private_reader.reader.compare.search_path = ["PUMS"]


res = private_reader.execute('SELECT COUNT(*) FROM PUMS_large')
res.show()

Privacy Cost

The privacy parameters epsilon and delta are passed in to the private connection at instantiation time, and apply to each computed column during the life of the session. Privacy cost accrues indefinitely as new queries are executed, with the total accumulated privacy cost being available via the spent property of the connection's odometer:

privacy = Privacy(epsilon=0.1, delta=10e-7)

reader = from_connection(conn, metadata=metadata, privacy=privacy)
print(reader.odometer.spent)  # (0.0, 0.0)

result = reader.execute('SELECT COUNT(*) FROM PUMS.PUMS')
print(reader.odometer.spent)  # approximately (0.1, 10e-7)

The privacy cost increases with the number of columns:

reader = from_connection(conn, metadata=metadata, privacy=privacy)
print(reader.odometer.spent)  # (0.0, 0.0)

result = reader.execute('SELECT AVG(age), AVG(income) FROM PUMS.PUMS')
print(reader.odometer.spent)  # approximately (0.4, 10e-6)

The odometer is advanced immediately before the differentially private query result is returned to the caller. If the caller wishes to estimate the privacy cost of a query without running it, get_privacy_cost can be used:

reader = from_connection(conn, metadata=metadata, privacy=privacy)
print(reader.odometer.spent)  # (0.0, 0.0)

cost = reader.get_privacy_cost('SELECT AVG(age), AVG(income) FROM PUMS.PUMS')
print(cost)  # approximately (0.4, 10e-6)

print(reader.odometer.spent)  # (0.0, 0.0)

Note that the total privacy cost of a session accrues at a slower rate than the sum of the individual query costs obtained by get_privacy_cost. The odometer accrues all invocations of mechanisms for the life of a session, and uses them to compute total spend.

reader = from_connection(conn, metadata=metadata, privacy=privacy)
query = 'SELECT COUNT(*) FROM PUMS.PUMS'
epsilon_single, _ = reader.get_privacy_cost(query)
print(epsilon_single)  # 0.1

# no queries executed yet
print(reader.odometer.spent)  # (0.0, 0.0)

for _ in range(100):
    reader.execute(query)

epsilon_many, _ = reader.odometer.spent
print(f'{epsilon_many} < {epsilon_single * 100}')

Histograms

SQL group by queries represent histograms binned by grouping key. Queries over a grouping key with unbounded or non-public dimensions expose privacy risk. For example:

SELECT last_name, COUNT(*) FROM Sales GROUP BY last_name

In the above query, if someone with a distinctive last name is included in the database, that person's record might accidentally be revealed, even if the noisy count returns 0 or negative. To prevent this from happening, the system will automatically censor dimensions which would violate differential privacy.

Private Synopsis

A private synopsis is a pre-computed set of differentially private aggregates that can be filtered and aggregated in various ways to produce new reports. Because the private synopsis is differentially private, reports generated from the synopsis do not need to have additional privacy applied, and the synopsis can be distributed without risk of additional privacy loss. Reports over the synopsis can be generated with non-private SQL, within an Excel Pivot Table, or through other common reporting tools.

You can see a sample notebook for creating private synopsis suitable for consumption in Excel or SQL.

Limitations

You can think of the data access layer as simple middleware that allows composition of opendp computations using the SQL language. The SQL language provides a limited subset of what can be expressed through the full opendp library. For example, the SQL language does not provide a way to set per-field privacy budget.

Because we delegate the computation of exact aggregates to the underlying database engines, execution through the SQL layer can be considerably faster, particularly with database engines optimized for precomputed aggregates. However, this design choice means that analysis graphs composed with SQL language do not access data in the engine on a per-row basis. Therefore, SQL queries do not currently support algorithms that require per-row access, such as quantile algorithms that use underlying values. This is a limitation that future releases will relax for database engines that support row-based access, such as Spark.

The SQL processing layer has limited support for bounding contributions when individuals can appear more than once in the data. This includes ability to perform reservoir sampling to bound contributions of an individual, and to scale the sensitivity parameter. These parameters are important when querying reporting tables that might be produced from subqueries and joins, but require caution to use safely.

For this release, we recommend using the SQL functionality while bounding user contribution to 1 row. The platform defaults to this option by setting max_contrib to 1, and should only be overridden if you know what you are doing. Future releases will focus on making these options easier for non-experts to use safely.

Communication

Releases and Contributing

Please let us know if you encounter a bug by creating an issue.

We appreciate all contributions. Please review the contributors guide. We welcome pull requests with bug-fixes without prior discussion.

If you plan to contribute new features, utility functions or extensions, please first open an issue and discuss the feature with us.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartnoise_sql-1.0.6.tar.gz (123.1 kB view details)

Uploaded Source

Built Distribution

smartnoise_sql-1.0.6-py3-none-any.whl (144.5 kB view details)

Uploaded Python 3

File details

Details for the file smartnoise_sql-1.0.6.tar.gz.

File metadata

  • Download URL: smartnoise_sql-1.0.6.tar.gz
  • Upload date:
  • Size: 123.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.13.1 Darwin/24.1.0

File hashes

Hashes for smartnoise_sql-1.0.6.tar.gz
Algorithm Hash digest
SHA256 680d909fefd67453ed4d33d63be4cae0ceecc15987a3ea9c7520eff66d059d8e
MD5 2ac79ed6f395622f4ea37f9c8b93eb16
BLAKE2b-256 a70a4035012a854b6014a36aafb3a5c6af068ed13aef494db7c65c44dc0bbcca

See more details on using hashes here.

File details

Details for the file smartnoise_sql-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: smartnoise_sql-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 144.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.13.1 Darwin/24.1.0

File hashes

Hashes for smartnoise_sql-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 95f29f3eef7527d99d9f0ddbc89e4a6ce069c2748c3e729b9efb5288dcba3b1c
MD5 ae6ae142cc01177b777aa84076cec520
BLAKE2b-256 8d674e944d14994da56319f1b8e947b8f2253685b6cd797f2f33b5eca0b1de8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page