Skip to main content

Word-level Metric Local Differential Privacy Mechanisms

Project description

MLDP

PyPI version GitHub stars License

This repository contains the official implementation for the paper: A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking The Privacy-Utility Trade-off (LREC-COLING 2024). It provides production-ready, highly optimized implementations for six word-level Metric Local Differential Privacy (MLDP) mechanisms.

Included Mechanisms

The package implements the following MLDP text privatization strategies:

Note that the code for SanText is not included as it is already publicly available here.

Installation

Getting started is as simple as installing the package:

pip install mldp-text

Basic Usage

The package exposes a unified factory function called get_mechanism() to seamlessly switch between different MLDP algorithms using string IDs.

Embedding Perturbation Mechanisms

For any of these mechanism, initialization is straightforward. By default, mechanisms look for an optimized faiss index to accelerate nearest-neighbor lookups:

import mldp_text

# Initialize your chosen strategy
mechanism = mldp_text.get_mechanism("multivariate_calibrated", epsilon=1, use_faiss=True)

# Privatize individual words
perturbed_word = mechanism.replace_word("pizza")
print(perturbed_word)

SynTF Mechanism

The SynTF mechanism is frequency-driven and requires a document corpus to pre-calculate and cache its reference TF-IDF matrix:

import mldp_text

corpus = ["your list of reference dataset documents here", "another document sample"]

# Initialize SynTF with document data
mechanism = mldp_text.get_mechanism("syntf", epsilon=1.0, data=corpus)

perturbed_word = mechanism.replace_word("pizza")

Supported Mechanisms

When using get_mechanism(name), you can pass any of the following string variants for the name parameter (case-insensitive, hyphens/underscores are normalized automatically):

MLDP Mechanism Allowed String IDs (name=)
MultivariateCalibrated multivariate_calibrated
TruncatedGumbel truncated_gumbel
VickreyMechanism vickrey
TEM tem
Mahalanobis mahalanobis
SynTF syntf

Embedding Models

By default, the package looks for the glove.840B.300d embedding model pre-filtered to a fixed companion vocabulary (data/vocab.txt). Both assets are derived from the official Stanford GloVe project.

Loading Custom Embeddings

You can pass your own custom word embedding model into any mechanism. The package automatically inspects your file header beforehand to confirm it aligns with the native gensim format standard: [VOCAB SIZE] [EMBEDDING DIMENSION] (e.g., 400000 300).

You can feed custom paths into the package in two ways:

Option 1: Session-Wide Override

Change the underlying fallback path before instantiating any mechanisms:

import mldp_text

mldp.utils.EMBED = "/path/to/your/custom_gensim_embeddings.txt"

engine = mldp_text.get_mechanism("mahalanobis", epsilon=1.2)

Option 2: Mechanism Parameter

Pass the file path directly to the instantiation call:

import mldp_text

engine = mldp_text.get_mechanism(
    "vickrey", 
    epsilon=1, 
    embed="/path/to/custom_vectors.txt"
)

Get Privatizing!

With these methods, you can now explore word-level Metric Local Differential Privacy text privatization. In case of any questions or suggestions, feel free to reach out to the authors.

Citation

If you find this work useful, please consider citing the original LREC-COLING work, which implemented and evaluated these MLDP mechanisms:

@inproceedings{meisenbacher-etal-2024-comparative,
    title = "A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking the Privacy-Utility Trade-off",
    author = "Meisenbacher, Stephen  and
      Nandakumar, Nihildev  and
      Klymenko, Alexandra  and
      Matthes, Florian",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.16/",
    pages = "174--185"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mldp_text-0.1.2.tar.gz (38.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mldp_text-0.1.2-py3-none-any.whl (33.5 MB view details)

Uploaded Python 3

File details

Details for the file mldp_text-0.1.2.tar.gz.

File metadata

  • Download URL: mldp_text-0.1.2.tar.gz
  • Upload date:
  • Size: 38.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for mldp_text-0.1.2.tar.gz
Algorithm Hash digest
SHA256 85af800b94f15b9398d4f1f37e925cbbb9e38b75e550f8914b892e7d47c68fc6
MD5 833cd47533bd878c2822f92be2c69b57
BLAKE2b-256 de1166a1e1f293479e4f0b9b44944065c9ccb34df9252d0d1ea65f1525f67422

See more details on using hashes here.

File details

Details for the file mldp_text-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mldp_text-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 33.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for mldp_text-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 70bbe36cc4f6435726d44131eb9635f2fffd3a227506a6ab41bc9c07d729a17e
MD5 47a43dec56a1f537f955130ac65d19aa
BLAKE2b-256 514d92456048b894926e173b2a86829dfd4f779decbc5e43d0e903a22a2603c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page