Anonymize personal information in chat data for LLM training

These details have not been verified by PyPI

Project links

Project description

Parfum

Strip the sensitive stuff from your chat data before you train on it.

I built Parfum because I got tired of manually cleaning up PII from datasets before fine-tuning. The name's a play on how perfume covers up smells—this library covers up personal info while keeping your data useful.

What's this for?

You've got chat logs, customer support transcripts, or conversational data you want to train a model on. Problem is, it's full of emails, phone numbers, credit cards, and who knows what else. You need that gone, but you still want the conversations to make sense.

That's what Parfum does. It finds the sensitive bits and replaces them however you want—placeholders, masked versions, fake data, or just nukes them entirely.

Getting started

pip install parfum

Want to catch people's names and locations too? You'll need spaCy:

pip install parfum[ner]
python -m spacy download en_core_web_sm

The NER stuff is optional. Without it you still get emails, phones, credit cards, SSNs, IPs, URLs, and dates. Just not names.

Basic usage

from parfum import Anonymizer

anon = Anonymizer()

text = "Hey, I'm John. Reach me at john@gmail.com or 555-123-4567"
result = anon.anonymize(text)

print(result.text)
# Hey, I'm [PERSON]. Reach me at [EMAIL] or [PHONE]

The result object gives you more than just the cleaned text:

result.text           # the anonymized version
result.original_text  # what you passed in
result.pii_found      # True if anything was detected
result.pii_count      # how many entities were found
result.matches        # list of PIIMatch objects with positions
result.replacements   # dict mapping original values to replacements

The five strategies

You can process PII in different ways depending on what you need:

replace (default) — Swaps PII with type labels

anon = Anonymizer(strategy="replace")
anon.anonymize("john@example.com").text
# → [EMAIL]

mask — Keeps structure but hides most characters

anon = Anonymizer(strategy="mask")
anon.anonymize("john@example.com").text
# → j***@e******.com

hash — Deterministic SHA-256 (first 16 chars)

anon = Anonymizer(strategy="hash")
anon.anonymize("john@example.com").text
# → a1b2c3d4e5f67890

fake — Generates realistic-looking replacements

anon = Anonymizer(strategy="fake", seed=42)  # seed for reproducibility
anon.anonymize("john@example.com").text
# → michael.smith@company.org

redact — Just removes it entirely

anon = Anonymizer(strategy="redact")
anon.anonymize("Email: john@example.com today").text
# → Email:  today

What it detects

Out of the box, the regex patterns catch:

Email addresses — standard RFC-ish patterns
Phone numbers — US/Canada formats, with or without country codes
Credit cards — Visa, Mastercard, Amex, plus generic 16-digit patterns
SSNs — US Social Security numbers in various formats
IP addresses — both IPv4 and IPv6
URLs — with or without protocol prefix
Dates — ISO format, US format, written out like "January 15, 2024"
IBANs — international bank account numbers

If you install the NER extra, you also get:

Person names — via spaCy's named entity recognition
Organizations — company names and such
Locations — cities, countries, addresses

Working with chat data

The library is built for conversations. Use anonymize_chat() to process message arrays while keeping the structure intact:

from parfum import Anonymizer

anon = Anonymizer(strategy="fake")

chat = [
    {"role": "user", "content": "I'm Sarah, call me at 555-0123"},
    {"role": "assistant", "content": "Got it Sarah! I'll call that number."}
]

clean = anon.anonymize_chat(chat)

The fake strategy keeps replacements consistent—if "Sarah" becomes "Emily" in the first message, it stays "Emily" throughout.

Processing files

Got a bunch of data files? There's support for that:

from parfum import Anonymizer, process_file, process_directory

anon = Anonymizer(strategy="fake")

# single file
process_file("input.jsonl", "output.jsonl", anon)

# whole directory
process_directory(
    "raw_data/",
    "clean_data/",
    anon,
    pattern="*.jsonl",
    recursive=True
)

Supported formats:

JSONL (.jsonl) — one JSON object per line, looks for "content" or "messages" keys
JSON (.json) — arrays of objects or single conversation objects
CSV (.csv) — you can specify which columns to process
Plain text (.txt or anything else) — line by line

For JSON/JSONL, it automatically handles the OpenAI-style messages format.

Command line

There's a CLI too:

# anonymize a file
parfum anonymize data.json -o clean.json --strategy fake

# process a directory
parfum anonymize ./chats/ -o ./output/ --recursive --pattern "*.jsonl"

# quick one-liner
parfum quick "Email me at john@test.com"
# → Email me at [EMAIL]

# just detect, don't change anything
parfum detect "My SSN is 123-45-6789"
# Found 1 PII entities:
#   [SSN] "123-45-6789" (pos 10-21)

Options:

-s, --strategy — replace, mask, hash, fake, or redact
-o, --output — where to write (required for anonymize)
--no-ner — skip the NER model, regex only
-r, --recursive — for directories
-p, --pattern — glob pattern like "*.txt"
--content-key — if your JSON uses something other than "content"
--locale — for fake data generation (default: en_US)
--seed — make fake data reproducible

Custom patterns

Need to catch something specific to your data? Add your own regex:

from parfum import Anonymizer, PIIType

anon = Anonymizer()

# catch employee IDs like "EMP-123456"
anon.add_pattern(
    name="employee_id",
    pattern=r"EMP-\d{6}",
    pii_type=PIIType.CUSTOM
)

result = anon.anonymize("Contact EMP-123456")
# → Contact [CUSTOM]

You can also assign custom patterns to existing types if you want them handled the same way:

anon.add_pattern(
    name="company_email",
    pattern=r"\w+@mycompany\.com",
    pii_type=PIIType.EMAIL  # treated as email for masking, faking, etc.
)

Different strategies per type

Maybe you want names faked but emails just masked:

from parfum import Anonymizer, PIIType, Strategy

anon = Anonymizer(strategy="replace")  # default

anon.set_strategy(Strategy.FAKE, pii_type=PIIType.PERSON)
anon.set_strategy(Strategy.MASK, pii_type=PIIType.EMAIL)

text = "John at john@test.com"
result = anon.anonymize(text)
# → Michael at j***@t***.com

Or if you want total control:

def my_email_handler(match):
    local, domain = match.text.split("@")
    return f"[HIDDEN]@{domain}"

anon.set_custom_anonymizer(PIIType.EMAIL, my_email_handler)

Without spaCy (lightweight mode)

If you don't need name detection and want to keep dependencies minimal:

from parfum import Anonymizer

anon = Anonymizer(use_ner=False)

You still get all the regex-based detection. Just no names, organizations, or locations.

Batch processing

texts = [
    "Email: a@b.com",
    "Phone: 555-1234",
    "Just some text with no PII"
]

results = anon.anonymize_many(texts)

for r in results:
    print(f"Found {r.pii_count} entities")

Detection only

If you just want to find PII without changing anything:

matches = anon.detect("Contact john@test.com or call 555-1234")

for m in matches:
    print(f"{m.pii_type.value}: {m.text} (position {m.start}-{m.end})")

Reproducibility

For the fake strategy, you can set a seed to get consistent results:

anon = Anonymizer(strategy="fake", seed=42)

Note that the caching is per-session. The same original value will get the same fake replacement within one Anonymizer instance. Call anon.clear_cache() if you want to reset that.

Locales

The fake data generator supports different locales:

anon = Anonymizer(strategy="fake", locale="de_DE")

Check Faker's documentation for available locales.

How masking works

Different PII types get masked differently:

Emails: john.doe@example.com → j***.d**@e******.com
Phones: 555-123-4567 → 555-***-**67 (keeps first 3, last 2)
Credit cards: 4111-1111-1111-1234 → ****-****-****-1234 (keeps last 4)
SSNs: 123-45-6789 → ***-**-6789 (keeps last 4)
IPs: 192.168.1.100 → 192.168.*.* (keeps first 2 octets)
Everything else: secretdata → s********a (first and last char)

Installation notes

The base install is just:

regex — for pattern matching
faker — for generating fake data

The [ner] extra adds:

spacy — for named entity recognition

If spaCy isn't installed or the model isn't downloaded, it'll just skip NER silently and use regex only.

License

MIT. Do whatever you want with it.

Bugs?

Open an issue. PRs welcome too.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parfum-0.1.0.tar.gz (38.7 kB view details)

Uploaded Jan 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parfum-0.1.0-py3-none-any.whl (19.7 kB view details)

Uploaded Jan 24, 2026 Python 3

File details

Details for the file parfum-0.1.0.tar.gz.

File metadata

Download URL: parfum-0.1.0.tar.gz
Upload date: Jan 24, 2026
Size: 38.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for parfum-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3f0faab4ead286399cd3a8df716bd52b21e05a242290ad17699a794c808caf81`
MD5	`6305d752783f526ca7376b0cdbdafb00`
BLAKE2b-256	`ffa191d7e97290e7caa5aee5cdf600320b9c4184a5fb954139c29d014637be6f`

See more details on using hashes here.

File details

Details for the file parfum-0.1.0-py3-none-any.whl.

File metadata

Download URL: parfum-0.1.0-py3-none-any.whl
Upload date: Jan 24, 2026
Size: 19.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for parfum-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b37daa37ad3f1b63a24d71b3bfda5a08a17201d2dc4d903aaca998c6f0543a5d`
MD5	`b11505c6b5341b10f4c14861346bcd49`
BLAKE2b-256	`67c18a909cad69421fa5bce29272a4110fd5f37ce2b89f0eb65a17181dbf569e`

See more details on using hashes here.

parfum 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Parfum

What's this for?

Getting started

Basic usage

The five strategies

What it detects

Working with chat data

Processing files

Command line

Custom patterns

Different strategies per type

Without spaCy (lightweight mode)

Batch processing

Detection only

Reproducibility

Locales

How masking works

Installation notes

License

Bugs?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes