Skip to main content

A hybrid PII detection, masking, and unmasking library using spaCy and regex.

Project description

pii-masker-gpt

A production-ready Python package for detecting, masking, and unmasking PII in text using a hybrid spaCy + regex approach.

Features

  • Detects PERSON, ADDRESS, EMAIL, PHONE
  • Uses spaCy en_core_web_sm for PERSON detection and location heuristics
  • Uses regex rules for EMAIL, PHONE, ZIP, and rich ADDRESS extraction
  • Produces consistent placeholder mapping across repeated values
  • Supports reversing masked text back to original content
  • Includes a convenient CLI for local usage

Installation

  1. Create or activate your Python environment.
  2. Install the package locally:
python -m pip install -e .
  1. Install the spaCy model:
python -m spacy download en_core_web_sm

Quick Start

from pii_masker import mask, unmask

text = (
    "John Doe lives at 123 Main St, Boston, MA 02118. "
    "His email is john.doe@example.com and his phone is (415) 555-4321."
)
result = mask(text)
print(result["masked_text"])
print(result["mapping"])

unmasked = unmask(result["masked_text"], result["mapping"])
print(unmasked)

Expected result

  • masked_text will contain placeholders like [PERSON1], [ADDRESS1], [EMAIL1], [PHONE1]
  • mapping preserves the original values for unmasking

LLM Prompt Privacy Use Case

This package is designed for privacy-conscious applications that need to pass user-generated prompts through any LLM model without exposing sensitive personal data to third-party services.

Typical workflow:

  1. Detect and mask PII in the user prompt before sending it to the LLM.
  2. Send only masked text to the LLM to avoid exposing sensitive personal data.
  3. Unmask the LLM response to restore the original context locally.

Complete Advanced Example Workflow (Multiple PII of Same Type)

Step 1: Original User Prompt (with Multiple PII Types - 8+ lines)

Please help me coordinate between my two offices. I work at 123 Main Street, Boston, MA 02118 and also at 456 Oak Avenue, New York, NY 10001.
For the Boston office, contact me at john.doe@example.com or (415) 555-4321.
For the New York office, use jane.smith@example.com or (212) 555-6789.
I need to discuss relocation of our primary office from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001.
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is john.doe@example.com if primary line is unavailable.
Ensure all responses reference the correct office before implementation.

Step 2: Mask the Prompt

from pii_masker import mask, unmask

original_prompt = """Please help me coordinate between my two offices. I work at 123 Main Street, Boston, MA 02118 and also at 456 Oak Avenue, New York, NY 10001.
For the Boston office, contact me at john.doe@example.com or (415) 555-4321.
For the New York office, use jane.smith@example.com or (212) 555-6789.
I need to discuss relocation of our primary office from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001.
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is john.doe@example.com if primary line is unavailable.
Ensure all responses reference the correct office before implementation."""

masked_result = mask(original_prompt)
masked_prompt = masked_result["masked_text"]
mapping = masked_result["mapping"]

Step 3: Masked Prompt Sent to LLM (Multiple PII Replaced Consistently)

Please help me coordinate between my two offices. I work at [ADDRESS1] and also at [ADDRESS2].
For the Boston office, contact me at [EMAIL1] or [PHONE1].
For the New York office, use [EMAIL2] or [PHONE2].
I need to discuss relocation of our primary office from [ADDRESS1] to [ADDRESS2].
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is [EMAIL1] if primary line is unavailable.
Ensure all responses reference the correct office before implementation.

Notice how consistent placeholders are used:

  • 123 Main Street, Boston, MA 02118[ADDRESS1] (appears 2x)
  • 456 Oak Avenue, New York, NY 10001[ADDRESS2] (appears 2x)
  • john.doe@example.com[EMAIL1] (appears 2x)
  • jane.smith@example.com[EMAIL2]
  • (415) 555-4321[PHONE1]
  • (212) 555-6789[PHONE2]

Step 4: LLM Response (with masked placeholders)

Certainly! Here's a coordinated transition plan:

For [ADDRESS1] (Boston):
- Notify all staff at [ADDRESS1] of the transition
- Primary contact: [EMAIL1] or [PHONE1]
- Backup contact: [EMAIL1]

For [ADDRESS2] (New York):
- Establish new operations at [ADDRESS2]
- Primary contact: [EMAIL2] or [PHONE2]
- Ensure infrastructure at [ADDRESS2] is ready

Transition Timeline:
- Week 1: Communicate with staff at [ADDRESS1] and [ADDRESS2]
- Week 2-3: Begin relocation from [ADDRESS1] to [ADDRESS2]
- Week 4: Finalize all operations at [ADDRESS2]

All confirmations should be sent to [EMAIL1] and [EMAIL2].

Step 5: Unmask the Response Locally

unmasked_response = unmask(llm_response, mapping)
print(unmasked_response)

Step 6: Final Unmasked Response (All Original PII Restored)

Certainly! Here's a coordinated transition plan:

For 123 Main Street, Boston, MA 02118 (Boston):
- Notify all staff at 123 Main Street, Boston, MA 02118 of the transition
- Primary contact: john.doe@example.com or (415) 555-4321
- Backup contact: john.doe@example.com

For 456 Oak Avenue, New York, NY 10001 (New York):
- Establish new operations at 456 Oak Avenue, New York, NY 10001
- Primary contact: jane.smith@example.com or (212) 555-6789
- Ensure infrastructure at 456 Oak Avenue, New York, NY 10001 is ready

Transition Timeline:
- Week 1: Communicate with staff at 123 Main Street, Boston, MA 02118 and 456 Oak Avenue, New York, NY 10001
- Week 2-3: Begin relocation from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001
- Week 4: Finalize all operations at 456 Oak Avenue, New York, NY 10001

All confirmations should be sent to john.doe@example.com and jane.smith@example.com.

Key Benefits:

  • Original PII never leaves your system
  • LLM only processes masked tokens
  • Responses are unmasked locally after receiving them
  • Complete control over sensitive data

API

  • mask(text: str) -> dict
  • unmask(masked_text: str, mapping: dict) -> str
  • batch_mask(texts: list[str]) -> list[dict]

CLI Usage

After installing the package, use the command-line interface:

pii-masker "John lives at 123 Main St, Boston, MA 02118 and contacts john@example.com"

If you want the mapping output in formatted JSON, run:

pii-masker "Jane Doe, 500 Elm Rd, Springfield, IL 62704" --show-mapping

Sample CLI output

[PERSON1] lives at [ADDRESS1] and contacts [EMAIL1]
{
  "PERSON": {
    "John": "PERSON1"
  },
  "ADDRESS": {
    "123 Main St, Boston, MA 02118": "ADDRESS1"
  },
  "EMAIL": {
    "john@example.com": "EMAIL1"
  }
}

Testing

Run the test suite with pytest:

py -3 -m pytest

Notes

  • The package loads spaCy only once for best performance.
  • Address detection merges street, city, state, and ZIP heuristics to avoid address fragmentation.
  • Mapping is case-insensitive, so John and john reuse the same placeholder.

Publishing to GitHub and PyPI

GitHub

This repository is ready to be initialized as a Git repository and pushed to GitHub.

cd path/to/pii-masker-gpt
git init
git add .
git commit -m "Initial package commit"
git remote add origin https://github.com/<your-username>/<your-repo>.git
git branch -M main
git push -u origin main

The included GitHub Actions workflow at .github/workflows/python-publish.yml can publish a release automatically when a tag is pushed.

PyPI

To publish to PyPI manually:

python -m pip install --upgrade build twine
python -m build
python -m twine upload dist/*

To publish automatically from GitHub Actions, add a repository secret named PYPI_API_TOKEN with your PyPI API token.

Then create and push a tag:

git tag v0.1.0
git push origin v0.1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_masker_gpt-0.1.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pii_masker_gpt-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file pii_masker_gpt-0.1.0.tar.gz.

File metadata

  • Download URL: pii_masker_gpt-0.1.0.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for pii_masker_gpt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3854353d6223418d16b7d8af3d13cbfa540f327156dd92c7f0ca842c9e2ca90c
MD5 cee5c8af20c8a3434fb986dfa5ffd192
BLAKE2b-256 0f1077674054c8f19d586d14933e8b1586532bc795dd5352edd54993da22b6ad

See more details on using hashes here.

File details

Details for the file pii_masker_gpt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pii_masker_gpt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for pii_masker_gpt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ead834fabbb26445718036f2080c618089a1f04026b49989ff8d5fccdf20539
MD5 71181cb06fddca19b4f0e36397047dba
BLAKE2b-256 2c8de3d2b4732a93fa488d5a9e877ebbca793c68450ff38c4e5ce630af26c3f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page