Two-sample testing (A/B testing) for multinomial and multivariate continuous data under local differential privacy

These details have not been verified by PyPI

Project links

Project description

package `privateAB`: two-sample testing under local differential privacy

The package privateAB and the codes in this repository implement the private testing method introduced in the paper Minimax Optimal Two-Sample Testing under Local Differential Privacy, authored by Jongmin Mun, Seungwoo Kwak, and Ilmun Kim.

The full paper can be accessed at: https://arxiv.org/abs/2411.09064.

The code is written and tested in the following environment:

Operating System: CentOS Linux 7 (Core)
CPE OS Name: cpe:/o:centos:centos:7
Kernel: Linux 3.10.0-1127.19.1.el7.x86_64
Architecture: x86-64
Python Version: 3.7.12

The code is guaranteed to work with the following package versions:

numpy==1.21.6
pandas==1.3.5
torch==1.7.1

Data Requirements

The input data consists of 2D Torch tensors, except for the chi statistic, which requires 1D integer tensors. For multinomial data with a large number of categories, or for continuous data with high dimensionality (d) and bin number (κ) such that κ^d is large, or when the sample size is very large (e.g., k = κ^d > 1000 or n > 100,000), we recommend using a GPU.

Conda Environment Setup

We recommend importing the conda environment from the following files:

For Linux: LDPUtsEnvK40.yaml
For Windows: LDPUtsEnvK40_windows.yaml

Basic usage

Two main objects are utilized in this package: client, which implements the privacy mechanism, and server, which conducts the test.

Installation

pip install privateAB

Privatization of multinomial data

client takes raw data in the form of a PyTorch tensor and releases its locally differentially private representation.

In this example, we use the data_generator function from our paper, which internally utilizes the torch.multinomial function. Therefore, when using your own data, ensure it follows the same format as the output of torch.multinomial.

To get started, first import the necessary packages:

from privateAB.client import client
from privateAB.data_generator import data_generator

Now, using our data_generator function, we generate two independent datasets of multinomial samples.

import torch
#set probability vectors
sample_size   = 1000
d = 4 #number of categories of the multinomial data
param_dist    = 0.04 
p = torch.ones(d).div(d)
p2 = p.add(
        torch.remainder(
        torch.tensor(range(d)),
        2
        ).add(-1/2).mul(2).mul(bump)
    )
p1_idx = torch.cat( ( torch.arange(1, d), torch.tensor([0])), 0)
p1 = p2[p1_idx]

#create the data_generator instance
data_gen = data_generator() 

# generate raw data 
raw_data_1 = data_gen.generate_multinomial_data(p1, sample_size)
raw_data_2 = data_gen.generate_multinomial_data(p2, sample_size)

Next, we create an instance of the client class and use its release_private method to privatize the raw data.

The release_private method requires the following five inputs:

Privacy mechanism: A string specifying the mechanism to use ('bitflip', 'genrr', 'lapu', or 'disclapu').
Raw data: A torch.tensor object representing the input data.
Number of categories: The number of categories in the multinomial data.
Privacy parameter: The parameter controlling the level of local differential privacy.
Device: The computational device to be used ('cpu' or 'gpu') as supported by torch.

LDPclient = client() #create the client, which privatizes the data

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') #specify gpu or cpu

priv_mech  = 'bitflip' #choose among 'bitflip', 'genrr', 'lapu', 'disclapu'. bitflip corresponds to rappor in the paper.

private_data_1 = LDPclient.release_private(
            priv_mech,
            raw_data_1,
            d,
            0.9,
            device
        )
private_data_2 = LDPclient.release_private(
            priv_mech,
            raw_data_2,
            d,
            0.9,
            device
        )

Testing of multinomial data

The test is conducted using one of the following server instances: server_multinomial_bitflip, server_ell2, or server_multinomial_genrr. These correspond to the ProjChi, ell2, and Chi statistics discussed in the paper.

The first two servers (server_multinomial_bitflip and server_ell2) can process privatized views generated using the 'bitflip', 'lapu', or 'disclapu' mechanisms.
The server_multinomial_genrr instance, however, exclusively supports privatized views generated by the 'genrr' mechanism.

To proceed, we first create a server instance, which requires the privacy parameter as input. Next, we load the privatized data using the load_private_data_multinomial method. This method takes the following five inputs:

First private data object: The first dataset's privatized representation.
Second private data object: The second dataset's privatized representation (for A/B testing).
Number of categories: The number of categories in the multinomial data.
Device for the first private data: The torch device (CPU or GPU) used to process the first dataset.
Device for the second private data: The torch device used to process the second dataset.

We allow two separate devices to accommodate large-scale datasets where GPU memory might be limited, requiring the calculations to be performed separately for each of the two data set. However, you can use the same device for both datasets if memory is not a concern.

from privateAB.server import server_multinomial_bitflip
server_multinomial_bitflip(0.9) #create an instance
server_private.load_private_data_multinomial(
    private_data_1, private_data_2 ,
    d,
    device,
    device
    )

Now we run the test. Any of the server instances (server_ell2, server_multinomial_bitflip, or server_multinomial_genrr) can calculate the permutation p-value using the release_p_value_permutation method.

This method takes a single input:

Number of permutations: The number of permutations to perform.

It returns two outputs:

p-value: The significance level of the test.
Test statistic value: The calculated value of the test statistic.

p_value, statistic = server_private.release_p_value_permutation(n_permutation)

server_multinomial_bitflip and server_multinomial_genrr can also compute the p-value based on the asymptotic chi-square null distribution using the release_p_value method.

This method does not require any input arguments. It directly outputs:

p-value: The significance level based on the chi-square null distribution.

p_value, statistic = server_private.release_p_value()

Privatization of continuous data

As discussed in our paper, the privatization of continuous data uses a binning method. We support data in the form of a $d$-dimensional PyTorch tensor, where each dimension falls within the interval $[0,1]$. If your data lies outside this range, you should apply an appropriate transformation, such as the CDF transformation mentioned in our paper.

For convenience, we use our data_generator function to create two sets of multivariate continuous data. This function ensures the generated data adheres to the required format and simplifies the process of preparing data for privatization.

import torch
from privateAB.data_generator import data_generator
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') #specify gpu or cpu

d=3
copula_mean_1 = -0.5 * torch.ones(d).to(device)
copula_mean_2 =  -copula_mean_1
copula_sigma = (0.5 * torch.ones(d,d) + 0.5 * torch.eye(d)).to(device)
data_gen = data_generator()
raw_data_1 = data_gen.generate_copula_gaussian_data(sample_size, copula_mean_1, copula_sigma)
raw_data_2 = data_gen.generate_copula_gaussian_data(sample_size, copula_mean_2, copula_sigma)

Now we privatize the multivariate continuous data using the release_private_conti method. This method is similar to release_private but automatically detects the data's dimensionality. Instead of specifying the number of categories, you provide the number of bins for discretizing the data.

The release_private_conti method requires the following five inputs:

Privacy mechanism: A string specifying the mechanism to use ('bitflip', 'genrr', 'lapu', or 'disclapu').
Raw data: A torch.tensor object representing the input multivariate continuous data.
Number of bins: The number of bins to discretize each dimension of the data.
Privacy parameter: The parameter controlling the level of local differential privacy.
Device: The computational device to be used ('cpu' or 'gpu') as supported by torch.

privacy_level=0.9
n_bin=4
data_y_priv = LDPclient.release_private_conti(
            priv_mech,
            data_gen.generate_copula_gaussian_data(sample_size, copula_mean_1, copula_sigma),
            privacy_level,
            n_bin,
            device
        )

data_z_priv = LDPclient.release_private_conti(
            priv_mech,
            data_gen.generate_copula_gaussian_data(sample_size, copula_mean_2, copula_sigma),
            privacy_level,
            n_bin,
            device
        )

Testing of continuous data

After privatization, the data format aligns with that of multinomial data, allowing the same testing procedures to be applied.

One important note is that the number of categories should equal the bin number raised to the power of the data dimension. You don’t need to calculate this manually, as it is automatically stored in LDPclient.alphabet_size_binned. This ensures consistency and simplifies the setup for testing.

Reproducing Simulation Results

To replicate the simulation results in the paper, run the following Python files. Adjust the sample size, data dimension, and privacy parameters as specified in each file:

Figure 2: Figure2_type_I.py or Figure2_type_I.ipynb
Figure 3: Figure3_multinomial.py or Figure3_multinomial.ipynb
Figure 4: Figure4_density_location.py or Figure4_density_location.ipynb
Figure 5: Figure5_rappor_elltwo_vs_projchi.py or Figure5_rappor_elltwo_vs_projchi.ipynb
Figure 6: Figure6_density_scale.py or Figure6_density_scale.ipynb

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Nov 20, 2024

0.0.1

Nov 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

privateAB-0.0.2.tar.gz (14.1 kB view details)

Uploaded Nov 20, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

privateAB-0.0.2-py3-none-any.whl (12.0 kB view details)

Uploaded Nov 20, 2024 Python 3

File details

Details for the file privateAB-0.0.2.tar.gz.

File metadata

Download URL: privateAB-0.0.2.tar.gz
Upload date: Nov 20, 2024
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for privateAB-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0fd239870a0191d4eb1ae59f4be41731f8a337e2b480d0449f7d0c0146352a59`
MD5	`2dba6ca7fe65861eb439c72038751860`
BLAKE2b-256	`37958edd7c51519619a7922a9c81c56e2d10890b9c331d63cc39ef09c5faccac`

See more details on using hashes here.

File details

Details for the file privateAB-0.0.2-py3-none-any.whl.

File metadata

Download URL: privateAB-0.0.2-py3-none-any.whl
Upload date: Nov 20, 2024
Size: 12.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for privateAB-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93f9dcc91577c29b16c30c6997a4b4586f8fda021e5951c008a3159ed05e25c4`
MD5	`010d1d743a54e5696a323889f4a95cad`
BLAKE2b-256	`ff9ca56d4bacb4f90f45ea5f040b5a68332b276b0f733acca2679d07534cfdf1`

See more details on using hashes here.

privateAB 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

package `privateAB`: two-sample testing under local differential privacy

Data Requirements

Conda Environment Setup

Basic usage

Installation

Privatization of multinomial data

Testing of multinomial data

Privatization of continuous data

Testing of continuous data

Reproducing Simulation Results

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

privateAB 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

package privateAB: two-sample testing under local differential privacy

Data Requirements

Conda Environment Setup

Basic usage

Installation

Privatization of multinomial data

Testing of multinomial data

Privatization of continuous data

Testing of continuous data

Reproducing Simulation Results

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

package `privateAB`: two-sample testing under local differential privacy