Skip to main content

A tool for redacting PII information from text using LLMs

Project description

PII Redaction

A Python package for redacting Personally Identifiable Information (PII) from text using Large Language Models.

Installation

pip install pii-redaction

Or install from source:

git clone https://github.com/yourusername/pii-redaction.git
cd pii-redaction
pip install -e .

Usage

Command Line Interface

The package provides a command-line tool pii-redact with the following commands:

Process a JSONL dataset

For handling PII in JSONL files that contain messages (like conversation history):

pii-redact process-jsonl input.jsonl output.jsonl

Options:

  • --device: Device to use for processing (e.g., cuda, cpu)
  • PII handling modes (mutually exclusive):
    • --tag: Keep PII content between XML tags (default) <PII:type>content</PII:type>
    • --redact: Replace PII with just an empty tag <PII:type/>
    • --replace: Replace PII with fake data fake_data
  • --locale: Locale for generating fake data (default: en_US, only used with --replace)

Process text files

For handling PII in plain text files (one document per line):

pii-redact process-text input.txt output.txt

Options:

  • --device: Device to use for processing (e.g., cuda, cpu)
  • PII handling modes (mutually exclusive):
    • --tag: Keep PII content between XML tags (default) <PII:type>content</PII:type>
    • --redact: Replace PII with just an empty tag <PII:type/>
    • --replace: Replace PII with fake data fake_data
  • --locale: Locale for generating fake data (default: en_US, only used with --replace)

Examples

Tag PII in text documents (default mode):

pii-redact process-text emails.txt tagged_emails.txt

Redact PII completely:

pii-redact process-text emails.txt redacted_emails.txt --redact

Replace PII with fake data:

pii-redact process-text emails.txt anonymized_emails.txt --replace

Use a specific locale for fake data:

pii-redact process-text emails.txt anonymized_emails.txt --replace --locale=fr_FR

Process a JSONL dataset and redact PII:

pii-redact process-jsonl conversations.jsonl redacted_conversations.jsonl --redact

Python API

from pii_redaction import tag_pii_in_documents, clean_dataset, PIIHandlingMode

# Process text documents
documents = [
    "My name is John Doe and my email is john.doe@example.com",
    "Call me at 555-123-4567 and ask for my SSN: 123-45-6789"
]

# Tag PII (default mode)
tagged_documents = tag_pii_in_documents(documents, mode=PIIHandlingMode.TAG)

# Redact PII completely
redacted_documents = tag_pii_in_documents(documents, mode=PIIHandlingMode.REDACT)

# Replace PII with fake data
anonymized_documents = tag_pii_in_documents(
    documents, 
    mode=PIIHandlingMode.REPLACE,
    locale="en_US"
)

# Process a JSONL dataset
# Tag PII (default mode)
clean_dataset('input.jsonl', 'output.jsonl', mode=PIIHandlingMode.TAG)

# Redact PII in a JSONL dataset
clean_dataset('input.jsonl', 'redacted.jsonl', mode=PIIHandlingMode.REDACT)

# Replace PII with fake data in a JSONL dataset
clean_dataset(
    'input.jsonl', 
    'anonymized.jsonl', 
    mode=PIIHandlingMode.REPLACE,
    locale="en_US"
)

Key Features

Multiple PII handling options:

  • Tag PII: Identify and keep PII with XML tags like <PII:email_address>john.doe@example.com</PII:email_address>
  • Redact PII: Replace PII with just an empty tag like <PII:email_address/>
  • Replace PII: Replace identified PII with realistic fake data like <PII:email_address>jane.smith@example.org</PII:email_address>

Customizable: Choose from different locales for generating culturally appropriate fake data Consistent replacement: When replacing PII with fake data, maintains consistency (same PII values are replaced with the same fake values)

Supported PII Categories

The model can identify and tag the following PII categories:

  • age: a person's age
  • credit_card_info: a credit card number, expiration date, CCV, etc.
  • nationality: a country when used to reference place of birth, residence, or citizenship
  • date: a specific calendar date
  • date_of_birth: a specific calendar date representing birth
  • domain_name: a domain on the internet
  • email_address: an email ID
  • demographic_group: Anything that identifies race or ethnicity
  • gender: a gender identifier
  • personal_id: Any ID string like a national ID, subscriber number, etc.
  • other_id: Any ID not associated with a person like an organization ID, database ID, etc.
  • banking_number: a number associated with a bank account
  • medical_condition: A diagnosis, treatment code or other information identifying a medical condition
  • organization_name: name of an organization
  • person_name: name of a person
  • phone_number: a telephone number
  • street_address: a physical address
  • password: a secure string used for authentication
  • secure_credential: any secure credential like an API key, private key, 2FA token
  • religious_affiliation: anything that identifies religious affiliation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_redact-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pii_redact-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file pii_redact-0.1.0.tar.gz.

File metadata

  • Download URL: pii_redact-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.8

File hashes

Hashes for pii_redact-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2c676ec1679af4cd13351e9c8eb295fe65eefe10e21a03d99602d4cb67cde1af
MD5 a8defd60c5bb67e88eb50f5596f869ef
BLAKE2b-256 007fe020eb52f4f362ee8a802df3f762dd7527cf8eb6dbd606e2d2c2359cb3f2

See more details on using hashes here.

File details

Details for the file pii_redact-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pii_redact-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.8

File hashes

Hashes for pii_redact-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 823a419839b9d6510e039bf6ee60409390895c7ec48dd02721ca194db94bd749
MD5 422ea9dd0a91382079dba0af34f4fb83
BLAKE2b-256 32ba4043a3bfab5df5f1d733557954f941e0d2055a3fefb6a5500de9dd21ee91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page