Skip to main content

Word-level Metric Local Differential Privacy Mechanisms

Project description

MLDP

PyPI version GitHub stars License

This repository contains the official implementation for the paper: A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking The Privacy-Utility Trade-off (LREC-COLING 2024). It provides production-ready, highly optimized implementations for six word-level Metric Local Differential Privacy (MLDP) mechanisms.

Included Mechanisms

The package implements the following MLDP text privatization strategies:

Note that the code for SanText is not included as it is already publicly available here.

Installation

Getting started is as simple as installing the package:

pip install mldp_text

Basic Usage

The package exposes a unified factory function called get_mechanism() to seamlessly switch between different MLDP algorithms using string IDs.

Embedding Perturbation Mechanisms

For any of these mechanism, initialization is straightforward. By default, mechanisms look for an optimized faiss index to accelerate nearest-neighbor lookups:

import mldp_text

# Initialize your chosen strategy
mechanism = mldp_text.get_mechanism("multivariate_calibrated", epsilon=1, use_faiss=True)

# Privatize individual words
perturbed_word = mechanism.replace_word("pizza")
print(perturbed_word)

SynTF Mechanism

The SynTF mechanism is frequency-driven and requires a document corpus to pre-calculate and cache its reference TF-IDF matrix:

import mldp_text

corpus = ["your list of reference dataset documents here", "another document sample"]

# Initialize SynTF with document data
mechanism = mldp_text.get_mechanism("syntf", epsilon=1.0, data=corpus)

perturbed_word = mechanism.replace_word("pizza")

Supported Mechanisms

When using get_mechanism(name), you can pass any of the following string variants for the name parameter (case-insensitive, hyphens/underscores are normalized automatically):

MLDP Mechanism Allowed String IDs (name=)
MultivariateCalibrated multivariate_calibrated
TruncatedGumbel truncated_gumbel
VickreyMechanism vickrey
TEM tem
Mahalanobis mahalanobis
SynTF syntf

Embedding Models

By default, the package looks for the glove.840B.300d embedding model pre-filtered to a fixed companion vocabulary (data/vocab.txt). Both assets are derived from the official Stanford GloVe project.

Loading Custom Embeddings

You can pass your own custom word embedding model into any mechanism. The package automatically inspects your file header beforehand to confirm it aligns with the native gensim format standard: [VOCAB SIZE] [EMBEDDING DIMENSION] (e.g., 400000 300).

You can feed custom paths into the package in two ways:

Option 1: Session-Wide Override

Change the underlying fallback path before instantiating any mechanisms:

import mldp_text

mldp.utils.EMBED = "/path/to/your/custom_gensim_embeddings.txt"

engine = mldp_text.get_mechanism("mahalanobis", epsilon=1.2)

Option 2: Mechanism Parameter

Pass the file path directly to the instantiation call:

import mldp_text

engine = mldp_text.get_mechanism(
    "vickrey", 
    epsilon=1, 
    embed="/path/to/custom_vectors.txt"
)

Get Privatizing!

With these methods, you can now explore word-level Metric Local Differential Privacy text privatization. In case of any questions or suggestions, feel free to reach out to the authors.

Citation

If you find this work useful, please consider citing the original LREC-COLING work, which implemented and evaluated these MLDP mechanisms:

@inproceedings{meisenbacher-etal-2024-comparative,
    title = "A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking the Privacy-Utility Trade-off",
    author = "Meisenbacher, Stephen  and
      Nandakumar, Nihildev  and
      Klymenko, Alexandra  and
      Matthes, Florian",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.16/",
    pages = "174--185"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mldp_text-0.1.0.tar.gz (38.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mldp_text-0.1.0-py3-none-any.whl (33.5 MB view details)

Uploaded Python 3

File details

Details for the file mldp_text-0.1.0.tar.gz.

File metadata

  • Download URL: mldp_text-0.1.0.tar.gz
  • Upload date:
  • Size: 38.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for mldp_text-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c1a06dccdfe4d350b03ef81bc20e55971e8fb9ee580217a3d869d3ffa4fa0e1e
MD5 eed6c2194801b240dd2fb1ba0b5f7a2f
BLAKE2b-256 a259a154695228c8992e96ab40127978cf445fedd38f23ec75730c71aaf45a56

See more details on using hashes here.

File details

Details for the file mldp_text-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mldp_text-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for mldp_text-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bb3795b63c5a4ddd1c65028067c00cca910dfbe3456d20ec86e5c19f32db2d4d
MD5 d373b038425101c05c58754423ba83e6
BLAKE2b-256 311ab6f7430d6881475239880fa8ac534bce44cdf02d64bc19fc64051383da4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page