A tool for redacting PII information from text using LLMs
Project description
PII Redaction
A Python package for redacting Personally Identifiable Information (PII) from text using Large Language Models.
Installation
pip install pii-redaction
Or install from source:
git clone https://github.com/yourusername/pii-redaction.git
cd pii-redaction
pip install -e .
Usage
Command Line Interface
The package provides a command-line tool pii-redact with the following commands:
Process a JSONL dataset
For handling PII in JSONL files that contain messages (like conversation history):
pii-redact process-jsonl input.jsonl output.jsonl
Options:
--device: Device to use for processing (e.g., cuda, cpu)- PII handling modes (mutually exclusive):
--tag: Keep PII content between XML tags (default)<PII:type>content</PII:type>--redact: Replace PII with just an empty tag<PII:type/>--replace: Replace PII with fake datafake_data
--locale: Locale for generating fake data (default: en_US, only used with --replace)
Process text files
For handling PII in plain text files (one document per line):
pii-redact process-text input.txt output.txt
Options:
--device: Device to use for processing (e.g., cuda, cpu)- PII handling modes (mutually exclusive):
--tag: Keep PII content between XML tags (default)<PII:type>content</PII:type>--redact: Replace PII with just an empty tag<PII:type/>--replace: Replace PII with fake datafake_data
--locale: Locale for generating fake data (default: en_US, only used with --replace)
Examples
Tag PII in text documents (default mode):
pii-redact process-text emails.txt tagged_emails.txt
Redact PII completely:
pii-redact process-text emails.txt redacted_emails.txt --redact
Replace PII with fake data:
pii-redact process-text emails.txt anonymized_emails.txt --replace
Use a specific locale for fake data:
pii-redact process-text emails.txt anonymized_emails.txt --replace --locale=fr_FR
Process a JSONL dataset and redact PII:
pii-redact process-jsonl conversations.jsonl redacted_conversations.jsonl --redact
Python API
from pii_redaction import tag_pii_in_documents, clean_dataset, PIIHandlingMode
# Process text documents
documents = [
"My name is John Doe and my email is john.doe@example.com",
"Call me at 555-123-4567 and ask for my SSN: 123-45-6789"
]
# Tag PII (default mode)
tagged_documents = tag_pii_in_documents(documents, mode=PIIHandlingMode.TAG)
# Redact PII completely
redacted_documents = tag_pii_in_documents(documents, mode=PIIHandlingMode.REDACT)
# Replace PII with fake data
anonymized_documents = tag_pii_in_documents(
documents,
mode=PIIHandlingMode.REPLACE,
locale="en_US"
)
# Process a JSONL dataset
# Tag PII (default mode)
clean_dataset('input.jsonl', 'output.jsonl', mode=PIIHandlingMode.TAG)
# Redact PII in a JSONL dataset
clean_dataset('input.jsonl', 'redacted.jsonl', mode=PIIHandlingMode.REDACT)
# Replace PII with fake data in a JSONL dataset
clean_dataset(
'input.jsonl',
'anonymized.jsonl',
mode=PIIHandlingMode.REPLACE,
locale="en_US"
)
Key Features
Multiple PII handling options:
- Tag PII: Identify and keep PII with XML tags like
<PII:email_address>john.doe@example.com</PII:email_address> - Redact PII: Replace PII with just an empty tag like
<PII:email_address/> - Replace PII: Replace identified PII with realistic fake data like
<PII:email_address>jane.smith@example.org</PII:email_address>
Customizable: Choose from different locales for generating culturally appropriate fake data Consistent replacement: When replacing PII with fake data, maintains consistency (same PII values are replaced with the same fake values)
Supported PII Categories
The model can identify and tag the following PII categories:
- age: a person's age
- credit_card_info: a credit card number, expiration date, CCV, etc.
- nationality: a country when used to reference place of birth, residence, or citizenship
- date: a specific calendar date
- date_of_birth: a specific calendar date representing birth
- domain_name: a domain on the internet
- email_address: an email ID
- demographic_group: Anything that identifies race or ethnicity
- gender: a gender identifier
- personal_id: Any ID string like a national ID, subscriber number, etc.
- other_id: Any ID not associated with a person like an organization ID, database ID, etc.
- banking_number: a number associated with a bank account
- medical_condition: A diagnosis, treatment code or other information identifying a medical condition
- organization_name: name of an organization
- person_name: name of a person
- phone_number: a telephone number
- street_address: a physical address
- password: a secure string used for authentication
- secure_credential: any secure credential like an API key, private key, 2FA token
- religious_affiliation: anything that identifies religious affiliation
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pii_redact-0.1.0.tar.gz.
File metadata
- Download URL: pii_redact-0.1.0.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c676ec1679af4cd13351e9c8eb295fe65eefe10e21a03d99602d4cb67cde1af
|
|
| MD5 |
a8defd60c5bb67e88eb50f5596f869ef
|
|
| BLAKE2b-256 |
007fe020eb52f4f362ee8a802df3f762dd7527cf8eb6dbd606e2d2c2359cb3f2
|
File details
Details for the file pii_redact-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pii_redact-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
823a419839b9d6510e039bf6ee60409390895c7ec48dd02721ca194db94bd749
|
|
| MD5 |
422ea9dd0a91382079dba0af34f4fb83
|
|
| BLAKE2b-256 |
32ba4043a3bfab5df5f1d733557954f941e0d2055a3fefb6a5500de9dd21ee91
|