A hybrid PII detection, masking, and unmasking library using spaCy and regex.
Project description
pii-masker-gpt
A production-ready Python package for detecting, masking, and unmasking PII in text using a hybrid spaCy + regex approach.
Features
- Detects PERSON, ADDRESS, EMAIL, PHONE
- Uses spaCy
en_core_web_smfor PERSON detection and location heuristics - Uses regex rules for EMAIL, PHONE, ZIP, and rich ADDRESS extraction
- Produces consistent placeholder mapping across repeated values
- Supports reversing masked text back to original content
- Includes a convenient CLI for local usage
Installation
- Create or activate your Python environment.
- Install the package locally:
python -m pip install -e .
- Install the spaCy model:
python -m spacy download en_core_web_sm
Quick Start
from pii_masker import mask, unmask
text = (
"John Doe lives at 123 Main St, Boston, MA 02118. "
"His email is john.doe@example.com and his phone is (415) 555-4321."
)
result = mask(text)
print(result["masked_text"])
print(result["mapping"])
unmasked = unmask(result["masked_text"], result["mapping"])
print(unmasked)
Expected result
masked_textwill contain placeholders like[PERSON1],[ADDRESS1],[EMAIL1],[PHONE1]mappingpreserves the original values for unmasking
LLM Prompt Privacy Use Case
This package is designed for privacy-conscious applications that need to pass user-generated prompts through any LLM model without exposing sensitive personal data to third-party services.
Typical workflow:
- Detect and mask PII in the user prompt before sending it to the LLM.
- Send only masked text to the LLM to avoid exposing sensitive personal data.
- Unmask the LLM response to restore the original context locally.
Complete Advanced Example Workflow (Multiple PII of Same Type)
Step 1: Original User Prompt (with Multiple PII Types - 8+ lines)
Please help me coordinate between my two offices. I work at 123 Main Street, Boston, MA 02118 and also at 456 Oak Avenue, New York, NY 10001.
For the Boston office, contact me at john.doe@example.com or (415) 555-4321.
For the New York office, use jane.smith@example.com or (212) 555-6789.
I need to discuss relocation of our primary office from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001.
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is john.doe@example.com if primary line is unavailable.
Ensure all responses reference the correct office before implementation.
Step 2: Mask the Prompt
from pii_masker import mask, unmask
original_prompt = """Please help me coordinate between my two offices. I work at 123 Main Street, Boston, MA 02118 and also at 456 Oak Avenue, New York, NY 10001.
For the Boston office, contact me at john.doe@example.com or (415) 555-4321.
For the New York office, use jane.smith@example.com or (212) 555-6789.
I need to discuss relocation of our primary office from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001.
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is john.doe@example.com if primary line is unavailable.
Ensure all responses reference the correct office before implementation."""
masked_result = mask(original_prompt)
masked_prompt = masked_result["masked_text"]
mapping = masked_result["mapping"]
Step 3: Masked Prompt Sent to LLM (Multiple PII Replaced Consistently)
Please help me coordinate between my two offices. I work at [ADDRESS1] and also at [ADDRESS2].
For the Boston office, contact me at [EMAIL1] or [PHONE1].
For the New York office, use [EMAIL2] or [PHONE2].
I need to discuss relocation of our primary office from [ADDRESS1] to [ADDRESS2].
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is [EMAIL1] if primary line is unavailable.
Ensure all responses reference the correct office before implementation.
Notice how consistent placeholders are used:
123 Main Street, Boston, MA 02118→[ADDRESS1](appears 2x)456 Oak Avenue, New York, NY 10001→[ADDRESS2](appears 2x)john.doe@example.com→[EMAIL1](appears 2x)jane.smith@example.com→[EMAIL2](415) 555-4321→[PHONE1](212) 555-6789→[PHONE2]
Step 4: LLM Response (with masked placeholders)
Certainly! Here's a coordinated transition plan:
For [ADDRESS1] (Boston):
- Notify all staff at [ADDRESS1] of the transition
- Primary contact: [EMAIL1] or [PHONE1]
- Backup contact: [EMAIL1]
For [ADDRESS2] (New York):
- Establish new operations at [ADDRESS2]
- Primary contact: [EMAIL2] or [PHONE2]
- Ensure infrastructure at [ADDRESS2] is ready
Transition Timeline:
- Week 1: Communicate with staff at [ADDRESS1] and [ADDRESS2]
- Week 2-3: Begin relocation from [ADDRESS1] to [ADDRESS2]
- Week 4: Finalize all operations at [ADDRESS2]
All confirmations should be sent to [EMAIL1] and [EMAIL2].
Step 5: Unmask the Response Locally
unmasked_response = unmask(llm_response, mapping)
print(unmasked_response)
Step 6: Final Unmasked Response (All Original PII Restored)
Certainly! Here's a coordinated transition plan:
For 123 Main Street, Boston, MA 02118 (Boston):
- Notify all staff at 123 Main Street, Boston, MA 02118 of the transition
- Primary contact: john.doe@example.com or (415) 555-4321
- Backup contact: john.doe@example.com
For 456 Oak Avenue, New York, NY 10001 (New York):
- Establish new operations at 456 Oak Avenue, New York, NY 10001
- Primary contact: jane.smith@example.com or (212) 555-6789
- Ensure infrastructure at 456 Oak Avenue, New York, NY 10001 is ready
Transition Timeline:
- Week 1: Communicate with staff at 123 Main Street, Boston, MA 02118 and 456 Oak Avenue, New York, NY 10001
- Week 2-3: Begin relocation from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001
- Week 4: Finalize all operations at 456 Oak Avenue, New York, NY 10001
All confirmations should be sent to john.doe@example.com and jane.smith@example.com.
Key Benefits:
- Original PII never leaves your system
- LLM only processes masked tokens
- Responses are unmasked locally after receiving them
- Complete control over sensitive data
API
mask(text: str) -> dictunmask(masked_text: str, mapping: dict) -> strbatch_mask(texts: list[str]) -> list[dict]
CLI Usage
After installing the package, use the command-line interface:
pii-masker "John lives at 123 Main St, Boston, MA 02118 and contacts john@example.com"
If you want the mapping output in formatted JSON, run:
pii-masker "Jane Doe, 500 Elm Rd, Springfield, IL 62704" --show-mapping
Sample CLI output
[PERSON1] lives at [ADDRESS1] and contacts [EMAIL1]
{
"PERSON": {
"John": "PERSON1"
},
"ADDRESS": {
"123 Main St, Boston, MA 02118": "ADDRESS1"
},
"EMAIL": {
"john@example.com": "EMAIL1"
}
}
Testing
Run the test suite with pytest:
py -3 -m pytest
Notes
- The package loads spaCy only once for best performance.
- Address detection merges street, city, state, and ZIP heuristics to avoid address fragmentation.
- Mapping is case-insensitive, so
Johnandjohnreuse the same placeholder.
Publishing to GitHub and PyPI
GitHub
This repository is ready to be initialized as a Git repository and pushed to GitHub.
cd path/to/pii-masker-gpt
git init
git add .
git commit -m "Initial package commit"
git remote add origin https://github.com/<your-username>/<your-repo>.git
git branch -M main
git push -u origin main
The included GitHub Actions workflow at .github/workflows/python-publish.yml can publish a release automatically when a tag is pushed.
PyPI
To publish to PyPI manually:
python -m pip install --upgrade build twine
python -m build
python -m twine upload dist/*
To publish automatically from GitHub Actions, add a repository secret named PYPI_API_TOKEN with your PyPI API token.
Then create and push a tag:
git tag v0.1.0
git push origin v0.1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pii_masker_gpt-0.1.0.tar.gz.
File metadata
- Download URL: pii_masker_gpt-0.1.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3854353d6223418d16b7d8af3d13cbfa540f327156dd92c7f0ca842c9e2ca90c
|
|
| MD5 |
cee5c8af20c8a3434fb986dfa5ffd192
|
|
| BLAKE2b-256 |
0f1077674054c8f19d586d14933e8b1586532bc795dd5352edd54993da22b6ad
|
File details
Details for the file pii_masker_gpt-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pii_masker_gpt-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ead834fabbb26445718036f2080c618089a1f04026b49989ff8d5fccdf20539
|
|
| MD5 |
71181cb06fddca19b4f0e36397047dba
|
|
| BLAKE2b-256 |
2c8de3d2b4732a93fa488d5a9e877ebbca793c68450ff38c4e5ce630af26c3f2
|