Skip to main content

A tool for redacting PII information from text using LLMs

Project description

PII Redaction

A Python package for redacting Personally Identifiable Information (PII) from text using Large Language Models.

Installation

pip install pii-redaction

Or install from source:

git clone https://github.com/yourusername/pii-redaction.git
cd pii-redaction
pip install -e .

Usage

Command Line Interface

The package provides a command-line tool pii-redact with the following commands:

Process a JSONL dataset

For handling PII in JSONL files that contain messages (like conversation history):

pii-redact process-jsonl input.jsonl output.jsonl

Options:

  • --device: Device to use for processing (e.g., cuda, cpu)
  • PII handling modes (mutually exclusive):
    • --tag: Keep PII content between XML tags (default) <PII:type>content</PII:type>
    • --redact: Replace PII with just an empty tag <PII:type/>
    • --replace: Replace PII with fake data fake_data
  • --locale: Locale for generating fake data (default: en_US, only used with --replace)

Process text files

For handling PII in plain text files (one document per line):

pii-redact process-text input.txt output.txt

Options:

  • --device: Device to use for processing (e.g., cuda, cpu)
  • PII handling modes (mutually exclusive):
    • --tag: Keep PII content between XML tags (default) <PII:type>content</PII:type>
    • --redact: Replace PII with just an empty tag <PII:type/>
    • --replace: Replace PII with fake data fake_data
  • --locale: Locale for generating fake data (default: en_US, only used with --replace)

Examples

Tag PII in text documents (default mode):

pii-redact process-text emails.txt tagged_emails.txt

Redact PII completely:

pii-redact process-text emails.txt redacted_emails.txt --redact

Replace PII with fake data:

pii-redact process-text emails.txt anonymized_emails.txt --replace

Use a specific locale for fake data:

pii-redact process-text emails.txt anonymized_emails.txt --replace --locale=fr_FR

Process a JSONL dataset and redact PII:

pii-redact process-jsonl conversations.jsonl redacted_conversations.jsonl --redact

Python API

from pii_redaction import tag_pii_in_documents, clean_dataset, PIIHandlingMode

# Process text documents
documents = [
    "My name is John Doe and my email is john.doe@example.com",
    "Call me at 555-123-4567 and ask for my SSN: 123-45-6789"
]

# Tag PII (default mode)
tagged_documents = tag_pii_in_documents(documents, mode=PIIHandlingMode.TAG)

# Redact PII completely
redacted_documents = tag_pii_in_documents(documents, mode=PIIHandlingMode.REDACT)

# Replace PII with fake data
anonymized_documents = tag_pii_in_documents(
    documents, 
    mode=PIIHandlingMode.REPLACE,
    locale="en_US"
)

# Process a JSONL dataset
# Tag PII (default mode)
clean_dataset('input.jsonl', 'output.jsonl', mode=PIIHandlingMode.TAG)

# Redact PII in a JSONL dataset
clean_dataset('input.jsonl', 'redacted.jsonl', mode=PIIHandlingMode.REDACT)

# Replace PII with fake data in a JSONL dataset
clean_dataset(
    'input.jsonl', 
    'anonymized.jsonl', 
    mode=PIIHandlingMode.REPLACE,
    locale="en_US"
)

Key Features

Multiple PII handling options:

  • Tag PII: Identify and keep PII with XML tags like <PII:email_address>john.doe@example.com</PII:email_address>
  • Redact PII: Replace PII with just an empty tag like <PII:email_address/>
  • Replace PII: Replace identified PII with realistic fake data like <PII:email_address>jane.smith@example.org</PII:email_address>

Customizable: Choose from different locales for generating culturally appropriate fake data Consistent replacement: When replacing PII with fake data, maintains consistency (same PII values are replaced with the same fake values)

Supported PII Categories

The model can identify and tag the following PII categories:

  • age: a person's age
  • credit_card_info: a credit card number, expiration date, CCV, etc.
  • nationality: a country when used to reference place of birth, residence, or citizenship
  • date: a specific calendar date
  • date_of_birth: a specific calendar date representing birth
  • domain_name: a domain on the internet
  • email_address: an email ID
  • demographic_group: Anything that identifies race or ethnicity
  • gender: a gender identifier
  • personal_id: Any ID string like a national ID, subscriber number, etc.
  • other_id: Any ID not associated with a person like an organization ID, database ID, etc.
  • banking_number: a number associated with a bank account
  • medical_condition: A diagnosis, treatment code or other information identifying a medical condition
  • organization_name: name of an organization
  • person_name: name of a person
  • phone_number: a telephone number
  • street_address: a physical address
  • password: a secure string used for authentication
  • secure_credential: any secure credential like an API key, private key, 2FA token
  • religious_affiliation: anything that identifies religious affiliation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_redaction-0.1.0.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pii_redaction-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file pii_redaction-0.1.0.tar.gz.

File metadata

  • Download URL: pii_redaction-0.1.0.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.8

File hashes

Hashes for pii_redaction-0.1.0.tar.gz
Algorithm Hash digest
SHA256 88144f72ed1a1a7c673d385685bd52296d01df69eced7a2646df9c6710485516
MD5 e31bd52d1e6a359cbdf85ee1ba8a2e56
BLAKE2b-256 98d57142cba57d419dd54a9711e477333a6f3f9807e4707dca2882a3e997c312

See more details on using hashes here.

File details

Details for the file pii_redaction-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pii_redaction-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f713dad8bd5821cebf910618029feb4bc8dfe5a68fdec61b9f67f462c5b5d051
MD5 cb59f18fa94d3e8809e44f63d3eb39ab
BLAKE2b-256 43fdc85868547f05c94fbec957d67f6a19d0eadd83711201a53373d0847b7a20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page