Anonymize personal information in chat data for LLM training
Project description
Parfum
Strip the sensitive stuff from your chat data before you train on it.
I built Parfum because I got tired of manually cleaning up PII from datasets before fine-tuning. The name's a play on how perfume covers up smells—this library covers up personal info while keeping your data useful.
What's this for?
You've got chat logs, customer support transcripts, or conversational data you want to train a model on. Problem is, it's full of emails, phone numbers, credit cards, and who knows what else. You need that gone, but you still want the conversations to make sense.
That's what Parfum does. It finds the sensitive bits and replaces them however you want—placeholders, masked versions, fake data, or just nukes them entirely.
Getting started
pip install parfum
Want to catch people's names and locations too? You'll need spaCy:
pip install parfum[ner]
python -m spacy download en_core_web_sm
The NER stuff is optional. Without it you still get emails, phones, credit cards, SSNs, IPs, URLs, and dates. Just not names.
Basic usage
from parfum import Anonymizer
anon = Anonymizer()
text = "Hey, I'm John. Reach me at john@gmail.com or 555-123-4567"
result = anon.anonymize(text)
print(result.text)
# Hey, I'm [PERSON]. Reach me at [EMAIL] or [PHONE]
The result object gives you more than just the cleaned text:
result.text # the anonymized version
result.original_text # what you passed in
result.pii_found # True if anything was detected
result.pii_count # how many entities were found
result.matches # list of PIIMatch objects with positions
result.replacements # dict mapping original values to replacements
The five strategies
You can process PII in different ways depending on what you need:
replace (default) — Swaps PII with type labels
anon = Anonymizer(strategy="replace")
anon.anonymize("john@example.com").text
# → [EMAIL]
mask — Keeps structure but hides most characters
anon = Anonymizer(strategy="mask")
anon.anonymize("john@example.com").text
# → j***@e******.com
hash — Deterministic SHA-256 (first 16 chars)
anon = Anonymizer(strategy="hash")
anon.anonymize("john@example.com").text
# → a1b2c3d4e5f67890
fake — Generates realistic-looking replacements
anon = Anonymizer(strategy="fake", seed=42) # seed for reproducibility
anon.anonymize("john@example.com").text
# → michael.smith@company.org
redact — Just removes it entirely
anon = Anonymizer(strategy="redact")
anon.anonymize("Email: john@example.com today").text
# → Email: today
What it detects
Out of the box, the regex patterns catch:
- Email addresses — standard RFC-ish patterns
- Phone numbers — US/Canada formats, with or without country codes
- Credit cards — Visa, Mastercard, Amex, plus generic 16-digit patterns
- SSNs — US Social Security numbers in various formats
- IP addresses — both IPv4 and IPv6
- URLs — with or without protocol prefix
- Dates — ISO format, US format, written out like "January 15, 2024"
- IBANs — international bank account numbers
If you install the NER extra, you also get:
- Person names — via spaCy's named entity recognition
- Organizations — company names and such
- Locations — cities, countries, addresses
Working with chat data
The library is built for conversations. Use anonymize_chat() to process message arrays while keeping the structure intact:
from parfum import Anonymizer
anon = Anonymizer(strategy="fake")
chat = [
{"role": "user", "content": "I'm Sarah, call me at 555-0123"},
{"role": "assistant", "content": "Got it Sarah! I'll call that number."}
]
clean = anon.anonymize_chat(chat)
The fake strategy keeps replacements consistent—if "Sarah" becomes "Emily" in the first message, it stays "Emily" throughout.
Processing files
Got a bunch of data files? There's support for that:
from parfum import Anonymizer, process_file, process_directory
anon = Anonymizer(strategy="fake")
# single file
process_file("input.jsonl", "output.jsonl", anon)
# whole directory
process_directory(
"raw_data/",
"clean_data/",
anon,
pattern="*.jsonl",
recursive=True
)
Supported formats:
- JSONL (.jsonl) — one JSON object per line, looks for "content" or "messages" keys
- JSON (.json) — arrays of objects or single conversation objects
- CSV (.csv) — you can specify which columns to process
- Plain text (.txt or anything else) — line by line
For JSON/JSONL, it automatically handles the OpenAI-style messages format.
Command line
There's a CLI too:
# anonymize a file
parfum anonymize data.json -o clean.json --strategy fake
# process a directory
parfum anonymize ./chats/ -o ./output/ --recursive --pattern "*.jsonl"
# quick one-liner
parfum quick "Email me at john@test.com"
# → Email me at [EMAIL]
# just detect, don't change anything
parfum detect "My SSN is 123-45-6789"
# Found 1 PII entities:
# [SSN] "123-45-6789" (pos 10-21)
Options:
-s, --strategy— replace, mask, hash, fake, or redact-o, --output— where to write (required for anonymize)--no-ner— skip the NER model, regex only-r, --recursive— for directories-p, --pattern— glob pattern like "*.txt"--content-key— if your JSON uses something other than "content"--locale— for fake data generation (default: en_US)--seed— make fake data reproducible
Custom patterns
Need to catch something specific to your data? Add your own regex:
from parfum import Anonymizer, PIIType
anon = Anonymizer()
# catch employee IDs like "EMP-123456"
anon.add_pattern(
name="employee_id",
pattern=r"EMP-\d{6}",
pii_type=PIIType.CUSTOM
)
result = anon.anonymize("Contact EMP-123456")
# → Contact [CUSTOM]
You can also assign custom patterns to existing types if you want them handled the same way:
anon.add_pattern(
name="company_email",
pattern=r"\w+@mycompany\.com",
pii_type=PIIType.EMAIL # treated as email for masking, faking, etc.
)
Different strategies per type
Maybe you want names faked but emails just masked:
from parfum import Anonymizer, PIIType, Strategy
anon = Anonymizer(strategy="replace") # default
anon.set_strategy(Strategy.FAKE, pii_type=PIIType.PERSON)
anon.set_strategy(Strategy.MASK, pii_type=PIIType.EMAIL)
text = "John at john@test.com"
result = anon.anonymize(text)
# → Michael at j***@t***.com
Or if you want total control:
def my_email_handler(match):
local, domain = match.text.split("@")
return f"[HIDDEN]@{domain}"
anon.set_custom_anonymizer(PIIType.EMAIL, my_email_handler)
Without spaCy (lightweight mode)
If you don't need name detection and want to keep dependencies minimal:
from parfum import Anonymizer
anon = Anonymizer(use_ner=False)
You still get all the regex-based detection. Just no names, organizations, or locations.
Batch processing
texts = [
"Email: a@b.com",
"Phone: 555-1234",
"Just some text with no PII"
]
results = anon.anonymize_many(texts)
for r in results:
print(f"Found {r.pii_count} entities")
Detection only
If you just want to find PII without changing anything:
matches = anon.detect("Contact john@test.com or call 555-1234")
for m in matches:
print(f"{m.pii_type.value}: {m.text} (position {m.start}-{m.end})")
Reproducibility
For the fake strategy, you can set a seed to get consistent results:
anon = Anonymizer(strategy="fake", seed=42)
Note that the caching is per-session. The same original value will get the same fake replacement within one Anonymizer instance. Call anon.clear_cache() if you want to reset that.
Locales
The fake data generator supports different locales:
anon = Anonymizer(strategy="fake", locale="de_DE")
Check Faker's documentation for available locales.
How masking works
Different PII types get masked differently:
- Emails:
john.doe@example.com→j***.d**@e******.com - Phones:
555-123-4567→555-***-**67(keeps first 3, last 2) - Credit cards:
4111-1111-1111-1234→****-****-****-1234(keeps last 4) - SSNs:
123-45-6789→***-**-6789(keeps last 4) - IPs:
192.168.1.100→192.168.*.*(keeps first 2 octets) - Everything else:
secretdata→s********a(first and last char)
Installation notes
The base install is just:
regex— for pattern matchingfaker— for generating fake data
The [ner] extra adds:
spacy— for named entity recognition
If spaCy isn't installed or the model isn't downloaded, it'll just skip NER silently and use regex only.
License
MIT. Do whatever you want with it.
Bugs?
Open an issue. PRs welcome too.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parfum-0.1.0.tar.gz.
File metadata
- Download URL: parfum-0.1.0.tar.gz
- Upload date:
- Size: 38.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f0faab4ead286399cd3a8df716bd52b21e05a242290ad17699a794c808caf81
|
|
| MD5 |
6305d752783f526ca7376b0cdbdafb00
|
|
| BLAKE2b-256 |
ffa191d7e97290e7caa5aee5cdf600320b9c4184a5fb954139c29d014637be6f
|
File details
Details for the file parfum-0.1.0-py3-none-any.whl.
File metadata
- Download URL: parfum-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b37daa37ad3f1b63a24d71b3bfda5a08a17201d2dc4d903aaca998c6f0543a5d
|
|
| MD5 |
b11505c6b5341b10f4c14861346bcd49
|
|
| BLAKE2b-256 |
67c18a909cad69421fa5bce29272a4110fd5f37ce2b89f0eb65a17181dbf569e
|