Word-level Metric Local Differential Privacy Mechanisms
Project description
This repository contains the official implementation for the paper: A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking The Privacy-Utility Trade-off (LREC-COLING 2024). It provides production-ready, highly optimized implementations for six word-level Metric Local Differential Privacy (MLDP) mechanisms.
Included Mechanisms
The package implements the following MLDP text privatization strategies:
- MultivariateCalibrated: paper
- TruncatedGumbel: paper
- VickreyMechanism: paper
- TEM: paper
- Mahalanobis: paper
- SynTF: paper
Note that the code for SanText is not included as it is already publicly available here.
Installation
Getting started is as simple as installing the package:
pip install mldp-text
Basic Usage
The package exposes a unified factory function called get_mechanism() to seamlessly switch between different MLDP algorithms using string IDs.
Embedding Perturbation Mechanisms
For any of these mechanism, initialization is straightforward. By default, mechanisms look for an optimized faiss index to accelerate nearest-neighbor lookups:
import mldp_text
# Initialize your chosen strategy
mechanism = mldp_text.get_mechanism("multivariate_calibrated", epsilon=1, use_faiss=True)
# Privatize individual words
perturbed_word = mechanism.replace_word("pizza")
print(perturbed_word)
SynTF Mechanism
The SynTF mechanism is frequency-driven and requires a document corpus to pre-calculate and cache its reference TF-IDF matrix:
import mldp_text
corpus = ["your list of reference dataset documents here", "another document sample"]
# Initialize SynTF with document data
mechanism = mldp_text.get_mechanism("syntf", epsilon=1.0, data=corpus)
perturbed_word = mechanism.replace_word("pizza")
Supported Mechanisms
When using get_mechanism(name), you can pass any of the following string variants for the name parameter (case-insensitive, hyphens/underscores are normalized automatically):
| MLDP Mechanism | Allowed String IDs (name=) |
|---|---|
| MultivariateCalibrated | multivariate_calibrated |
| TruncatedGumbel | truncated_gumbel |
| VickreyMechanism | vickrey |
| TEM | tem |
| Mahalanobis | mahalanobis |
| SynTF | syntf |
Embedding Models
By default, the package looks for the glove.840B.300d embedding model pre-filtered to a fixed companion vocabulary (data/vocab.txt). Both assets are derived from the official Stanford GloVe project.
Loading Custom Embeddings
You can pass your own custom word embedding model into any mechanism. The package automatically inspects your file header beforehand to confirm it aligns with the native gensim format standard: [VOCAB SIZE] [EMBEDDING DIMENSION] (e.g., 400000 300).
You can feed custom paths into the package in two ways:
Option 1: Session-Wide Override
Change the underlying fallback path before instantiating any mechanisms:
import mldp_text
mldp.utils.EMBED = "/path/to/your/custom_gensim_embeddings.txt"
engine = mldp_text.get_mechanism("mahalanobis", epsilon=1.2)
Option 2: Mechanism Parameter
Pass the file path directly to the instantiation call:
import mldp_text
engine = mldp_text.get_mechanism(
"vickrey",
epsilon=1,
embed="/path/to/custom_vectors.txt"
)
Get Privatizing!
With these methods, you can now explore word-level Metric Local Differential Privacy text privatization. In case of any questions or suggestions, feel free to reach out to the authors.
Citation
If you find this work useful, please consider citing the original LREC-COLING work, which implemented and evaluated these MLDP mechanisms:
@inproceedings{meisenbacher-etal-2024-comparative,
title = "A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking the Privacy-Utility Trade-off",
author = "Meisenbacher, Stephen and
Nandakumar, Nihildev and
Klymenko, Alexandra and
Matthes, Florian",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.16/",
pages = "174--185"
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mldp_text-0.1.1.tar.gz.
File metadata
- Download URL: mldp_text-0.1.1.tar.gz
- Upload date:
- Size: 38.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
022031eb2983ce22037568b021b43c10900920622bd25030999a648ce8a610f0
|
|
| MD5 |
692d6c3db385161018f921310e3a1362
|
|
| BLAKE2b-256 |
031e0f550849927c47b17968394eea39e8d791d5d76eb769407bd37449ebb723
|
File details
Details for the file mldp_text-0.1.1-py3-none-any.whl.
File metadata
- Download URL: mldp_text-0.1.1-py3-none-any.whl
- Upload date:
- Size: 33.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b641d96c117cb133e736aa1ad6955f2635194aa7108a65a0d0ce075f0bef5483
|
|
| MD5 |
7d03498648eca8ed2c54852ec7f5fb9d
|
|
| BLAKE2b-256 |
c406917b41083205fb2ce01dad884d4420d87ef752741bdd2670213dbe5d2dea
|