Sanitize text containing PII attributes
Project description
preempt
This is a modular version of Prϵϵmpt, meant to be used as part of other projects.
For the experiments and results found in Prϵϵmpt: Sanitizing Sensitive Prompts for LLMs, please refer to this repo.
Setup
- Clone this repo and navigate to the root directory (
preempt). - Install uv following the instructions here.
- Create a virtual environment with Python 3.11 and activate it:
uv venv --python 3.11
. ./.venv/bin/activate
uv sync
Usage
Additional usage examples can be found in demo.ipynb.
We will add support for generalized NER and sanitization in the near future.
Complete Usage Example
This is a complete usage example where we sanitize names and currency values. Make sure you either have Universal NER or Llama-3 8B Instruct available.
- Import all utilities:
# Import utils
from preempt.utils import *
- Initialize a
NERandSanitizerobject:
# Load NER object
# ner_model = NER("/path/to/uniner-7b-pii-v3", device="cuda:1")
ner_model = NER("/path/to/Meta-Llama-3-8B-Instruct/", device="cuda:1")
# Load Sanitizer object
sanitizer_name = Sanitizer(ner_model, key = "EF4359D8D580AA4F7F036D6F04FC6A94", tweak = "D8E7920AFA330A73")
sanitizer_money = Sanitizer(ner_model, key = "FF4359D8D580AA4F7F036D6F04FC6A94", tweak = "E8E7920AFA330A73")
# Sentences
sentences = ["Ben Parker and John Doe went to the bank and withdrew $200.", "Adam won $20 in the lottery."]
- Sanitize names in
sentences:
# Sanitizing names
sanitized_sentences, _ = sanitizer_name.encrypt(sentences, entity='Name', epsilon=1)
print("Sanitized sentences:")
print(sanitized_sentences)
"""
Prints:
Sanitized sentences:
['Jay Francois and Lamine Franklin went to the bank and withdrew $200.', 'Elie Vinod won $20 in the lottery.']
"""
- Sanitize currency values in
sanitized_sentences:
# Sanitizing currency values
sanitized_sentences, _ = sanitizer_money.encrypt(sanitized_sentences, entity='Money', epsilon=1)
print("Sanitized sentences:")
print(sanitized_sentences)
"""
Prints:
Sanitized sentences:
['Jay Francois and Lamine Franklin went to the bank and withdrew $769451698.', 'Elie Vinod won $37083668 in the lottery.']
"""
- Desanitize encrypted names in
sanitized_sentences:
# Desanitizing names
desanitized_sentences = sanitizer_name.decrypt(sanitized_sentences, entity='Name')
print("Desanitized sentences:")
print(desanitized_sentences)
"""
Prints:
Desanitized sentences:
['Ben Parker and John Doe went to the bank and withdrew $769451698.', 'Adam won $37083668 in the lottery.']
"""
- Desanitize encrypted currency values in
desanitized_sentences:
# Desanitizing currency values
desanitized_sentences = sanitizer_money.decrypt(desanitized_sentences, entity='Money')
print("Desanitized sentences:")
print(desanitized_sentences)
"""
Prints:
Desanitized sentences:
['Ben Parker and John Doe went to the bank and withdrew $200.', 'Adam won $20 in the lottery.']
"""
Extraction
We currently support Universal NER and Llama-3 8B Instruct for NER. We will add support for including your own NER models in the near future.
Initialize a NER class object by passing the path to one of the supported NER models mentioned above:
ner_model = NER("/path/to/Meta-Llama-3-8B-Instruct/", device="cuda:0")
Extract PII values found in a list of target strings using ner_model.extract():
sentences = ["Ben Parker and John Doe went to the bank.", "Who was late today? Adam."]
extracted = ner_model.extract(sentences, entity_type='{Name/Money/Age}')
Sanitization
We currently only support sanitization for names, currency values and age, using either FPE or m-LDP.
Initialize a Sanitizer class object by passing the previously initialized ner_model, a key and tweak parameter (required for the FF3 cipher used for FPE).
sanitizer = Sanitizer(ner_model, key = "EF4359D8D580AA4F7F036D6F04FC6A94", tweak = "D8E7920AFA330A73")
Sanitize a list of target strings using sanitizer.encrypt():
sanitized_sentences, _ = sanitizer.encrypt(sentences, entity='Name', epsilon=1, use_fpe=True, use_mdp=False)
PII values found during NER are stored under sanitizer.new_entities as a nested list.
The mappings between plain text and cipher text PII values are stored under sanitizer.entity_mapping. FPE will typically extract PII values from the sanitized sentences before decryption.
Sanitized sentences can be desanitized using sanitizer.decrypt():
desanitized_sentences = sanitizer.decrypt(sanitized_sentences, entity='Name')
Sanitizing multiple PII attributes
If you want to sanitize multiple sensitive attributes, create a sanitizer for each category separately.
For more examples, check out demo.ipynb
Usage tips
NER typically works better when the inputs are smaller. Consider breaking a large chunk of text into smaller sentences when using the sanitizer.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file preempt-0.1.4.tar.gz.
File metadata
- Download URL: preempt-0.1.4.tar.gz
- Upload date:
- Size: 79.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9b184e5d9ba68a1b821312df8f9675a8704cc7589bf866117f9296cddabfc39
|
|
| MD5 |
790dfb197d0a6ae256df4f2437daeafc
|
|
| BLAKE2b-256 |
edd3da293eff4c291b68a67d3284d5668d624c5c08fb30f24d033fb447aa8ffb
|
File details
Details for the file preempt-0.1.4-py3-none-any.whl.
File metadata
- Download URL: preempt-0.1.4-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
492ae402f8baa222cbb52b32d1f60ea72a19ce2756f54cc12d43a078b0bfc41e
|
|
| MD5 |
48190cc2b7ec1a43ae2d66c8a0a1cae7
|
|
| BLAKE2b-256 |
2ae48fa734c076a65a3f90d91caa61496662b646ec17157e5e17134c33db4667
|