Lightweight, tiered, bidirectional PII sanitizer for LLM pipelines
Project description
prompt-sanitizer
PII and secret sanitization for Python LLM pipelines.
prompt-sanitizer provides a typed API for detecting, redacting, anonymizing, and restoring sensitive values before they reach a model, tool, middleware layer, log sink, or SDK wrapper. FAST mode has zero required dependencies. SMART and FULL add optional NLP, synthetic replacement, and audit logging.
Install
Python 3.10+.
pip install prompt-sanitizer
pip install "prompt-sanitizer[nlp]"
pip install "prompt-sanitizer[synthetic]"
pip install "prompt-sanitizer[integrations]"
pip install "prompt-sanitizer[all]"
Optional extras
| Extra | Adds | Typical use |
|---|---|---|
nlp |
transformers + torch |
NER in SMART/FULL mode |
synthetic |
faker |
realistic fake replacements |
integrations |
framework / SDK adapters | LangChain, LlamaIndex, OpenAI, FastAPI, Django |
all |
all extras | full feature set |
Quick start
from prompt_sanitizer import Sanitizer, Mode
s = Sanitizer(mode=Mode.FAST)
result = s.sanitize("Contact Jane Doe at jane@example.com or 415-555-0112.")
print(result.text)
print(result.has_pii)
print(result.risk_score)
print(result.tokens)
for entity in result.entities:
print(entity.entity_type, entity.value, entity.replacement)
Modes
| Mode | Pipeline | Dependencies | Notes |
|---|---|---|---|
Mode.FAST |
regex + secret detectors | none | sub-ms, stdlib only |
Mode.SMART |
FAST + Piiranha NER | prompt-sanitizer[nlp] |
lazy-loads on first call |
Mode.FULL |
SMART + synthetic replacement + audit log | usually nlp + synthetic |
best for compliance-oriented flows |
FAST mode
from prompt_sanitizer import Sanitizer, Mode
s = Sanitizer(mode=Mode.FAST)
text = "SSN 078-05-1120, card 4111 1111 1111 1111, token sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
result = s.sanitize(text)
print(result.text)
print(result.entities)
print(result.tokens)
Use FAST for prompt pre-processing, log scrubbing, middleware guards, CI checks, and zero-dependency CLI tooling.
SMART mode
from prompt_sanitizer import Sanitizer, Mode
s = Sanitizer(mode=Mode.SMART)
result = s.sanitize(
"Alice from Acme Corp met us in Berlin on 2025-02-14. Email alice@acme.example."
)
print(result.text)
for entity in result.entities:
print(entity.entity_type, entity.value, entity.confidence)
Use SMART when prompts contain free-form prose with names, organizations, dates, or locations that regexes alone may miss.
FULL mode
from prompt_sanitizer import Sanitizer, Mode, SQLiteAuditLog
audit = SQLiteAuditLog("prompt_sanitizer_audit.db")
s = Sanitizer(mode=Mode.FULL, locale="en_US", on_detect="redact", audit_log=audit)
result = s.sanitize("Customer Jane Doe uses jane@example.com and 415-555-0112.")
print(result.text)
print(result.tokens)
print(s.audit.export(format="json"))
Use FULL when you want synthetic replacement plus an audit trail.
Public API
Sanitizer
Sanitizer(
mode: Mode = Mode.FAST,
locale: str = "en_US",
entities: list[EntityType] | None = None,
on_detect: str = "redact",
audit_log: BaseAuditLog | None = None,
)
| Parameter | Type | Description |
|---|---|---|
mode |
Mode |
detection pipeline |
locale |
str |
locale for synthetic replacement generation |
entities |
list[EntityType] | None |
optional allowlist of entity types |
on_detect |
str |
"redact", "warn", or "block" |
audit_log |
BaseAuditLog | None |
optional audit backend |
| Method | Signature | Description |
|---|---|---|
sanitize |
`sanitize(text: str, session_id: str | None = None) -> SanitizeResult` |
sanitize_batch |
sanitize_batch(texts: list[str]) -> list[SanitizeResult] |
sanitize multiple inputs |
session |
`session(session_id: str | None = None) -> Session` |
add_entity |
add_entity(name: str, pattern: str, confidence: float = 0.85) -> None |
register a custom entity |
stream |
`stream(source: AsyncIterable, session: Session | None) -> AsyncGenerator[str, None]` |
guard |
guard(on_detect: str) -> decorator |
decorate a function with sanitization logic |
audit |
`.audit -> BaseAuditLog | None` |
Detection policy
on_detect value |
Behavior |
|---|---|
"redact" |
rewrite the returned text |
"warn" |
return original text, but populate entities and scores |
"block" |
raise instead of returning sanitized text |
results = s.sanitize_batch(["Email a@example.com", "No sensitive data here"])
@s.guard(on_detect="redact")
def call_model(prompt: str) -> str:
return prompt
Mode, SanitizeResult, and DetectedEntity
Mode value |
Meaning |
|---|---|
Mode.FAST |
regex + secrets, zero deps, sub-ms |
Mode.SMART |
FAST + Piiranha NER, lazy loads on first call |
Mode.FULL |
SMART + synthetic replacement + audit log |
SanitizeResult attribute |
Type | Description |
|---|---|---|
text |
str |
sanitized text |
entities |
list[DetectedEntity] |
detected spans |
tokens |
dict[str, str] |
{original_value: replacement} map |
risk_score |
float |
composite score from 0.0 to 1.0 |
has_pii |
bool |
whether sensitive data was found |
DetectedEntity attribute |
Type | Description |
|---|---|---|
entity_type |
EntityType |
entity classification |
value |
str |
original matched value |
start |
int |
inclusive start offset |
end |
int |
exclusive end offset |
confidence |
float |
detection confidence |
replacement |
str | None |
replacement value, if generated |
result = s.sanitize("Contact me at sam@example.com")
assert result.has_pii is True
assert 0.0 <= result.risk_score <= 1.0
for entity in result.entities:
print(entity.entity_type, entity.value, entity.replacement)
Sessions and vaults
Use sessions when the model should never see raw values, but the final response should restore them.
from prompt_sanitizer import Sanitizer
s = Sanitizer()
session = s.session(session_id="support-chat-001")
clean_prompt = session.anonymize("My name is Elena Ruiz and my email is elena@company.com")
llm_reply = "Confirmed. I will email [EMAIL_1] shortly."
final_reply = session.deanonymize(llm_reply)
print(clean_prompt)
print(final_reply)
Session API |
Description |
|---|---|
session.anonymize(text: str) -> str |
replace PII with vault tokens |
session.deanonymize(text: str) -> str |
restore originals from the vault |
session.vault: Vault |
access the underlying vault |
Vault API |
Description |
|---|---|
vault.store(value: str, replacement: str) -> None |
store a mapping |
vault.lookup(replacement: str) -> str | None |
resolve token to original |
vault.reverse(value: str) -> str | None |
resolve original to replacement |
vault.clear() -> None |
clear all mappings |
vault = session.vault
vault.store("alice@example.com", "[EMAIL_1]")
print(vault.lookup("[EMAIL_1]"))
print(vault.reverse("alice@example.com"))
vault.clear()
Custom entities
Use add_entity() for internal identifiers, tenant-specific secrets, or domain-specific formats.
from prompt_sanitizer import Sanitizer
s = Sanitizer()
s.add_entity(name="customer_id", pattern=r"\bCUS-\d{8}\b", confidence=0.90)
s.add_entity(name="invoice_no", pattern=r"\bINV-\d{6}-[A-Z]{2}\b", confidence=0.88)
result = s.sanitize("Customer CUS-12345678 opened invoice INV-882211-US")
print(result.text)
print(result.entities)
Filtering by entity type
from prompt_sanitizer import Sanitizer, EntityType
s = Sanitizer(entities=[EntityType.EMAIL, EntityType.API_KEY])
result = s.sanitize("Email a@b.com and SSN 123-45-6789")
print(result.text)
Audit logging
Audit backends are optional. Use them when you want structured records of detections.
MemoryAuditLog
from prompt_sanitizer import MemoryAuditLog, Mode, Sanitizer
audit = MemoryAuditLog()
s = Sanitizer(mode=Mode.FULL, audit_log=audit)
s.sanitize("Email finance@example.com")
print(audit.events())
print(audit.export(format="json"))
SQLiteAuditLog
from prompt_sanitizer import SQLiteAuditLog, Mode, Sanitizer
audit = SQLiteAuditLog("audit.db")
s = Sanitizer(mode=Mode.FULL, audit_log=audit)
s.sanitize("Call +1 415 555 0112", session_id="request-17")
print(audit.events())
print(audit.export(format="csv"))
Audit API
| API | Description |
|---|---|
MemoryAuditLog() |
in-memory list of AuditEvent |
SQLiteAuditLog(path: str) |
SQLite-backed persisted log |
.events() -> list[AuditEvent] |
return recorded events |
.export(format: "json" | "csv") -> str |
export audit records |
Integrations
Install integration dependencies first:
pip install "prompt-sanitizer[integrations]"
LangChain
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.langchain import PromptSanitizerRunnable, SanitizedLLM
s = Sanitizer()
# As a runnable step in a chain
chain = PromptSanitizerRunnable(sanitizer=s) | llm | OutputParser()
result = chain.invoke("My email is dev@example.com")
# Or wrap the LLM directly
safe_llm = SanitizedLLM(llm, s)
reply = safe_llm.invoke("Contact alice@example.com with the summary.")
LlamaIndex
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.llamaindex import PromptSanitizerPostprocessor
s = Sanitizer()
postprocessor = PromptSanitizerPostprocessor(sanitizer=s)
query_engine = index.as_query_engine(node_postprocessors=[postprocessor])
response = query_engine.query("Summarize the contract for jane@example.com")
OpenAI SDK wrapper
import openai
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.openai import wrap
s = Sanitizer()
client = wrap(openai.OpenAI(), sanitizer=s)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "My card is 4111 1111 1111 1111"}],
)
FastAPI middleware
from fastapi import FastAPI
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.fastapi import SanitizerMiddleware
s = Sanitizer()
app = FastAPI()
app.add_middleware(SanitizerMiddleware, sanitizer=s, fields=["prompt", "message"])
Django middleware
MIDDLEWARE = ["prompt_sanitizer.integrations.django.SanitizerMiddleware"]
from prompt_sanitizer import Sanitizer
PROMPT_SANITIZER = {
"sanitizer": Sanitizer(),
"fields": ["prompt", "message"],
}
Entity types
| Group | Values |
|---|---|
| core PII | EMAIL, PHONE, SSN, CREDIT_CARD, IBAN, IP_ADDRESS, URL, DATE |
| identity / org | PERSON_NAME, ORGANIZATION, LOCATION |
| secrets | API_KEY, JWT_TOKEN, SECRET_KEY, AWS_KEY, GITHUB_TOKEN, OPENAI_KEY, ANTHROPIC_KEY |
| extension | CUSTOM |
Operational notes
- FAST mode is stdlib-only.
- SMART lazy-loads NER on first use.
- FULL is the best fit for synthetic replacement plus audit.
sanitize()is for one-shot calls.session()is for reversible multi-turn workflows.sanitize_batch()treats each input independently.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_prompt_sanitizer-0.1.0.tar.gz.
File metadata
- Download URL: ai_prompt_sanitizer-0.1.0.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da6dfff78da055f2d2a7e963851147466c9bf322814781597c045e265908914c
|
|
| MD5 |
57d83d6c8d92ea74f2877e7dc3d4f3cb
|
|
| BLAKE2b-256 |
4717b315b064d194b52fc0497c24f11ebf22941129fdffc269f3dbaf0bd0a4be
|
Provenance
The following attestation bundles were made for ai_prompt_sanitizer-0.1.0.tar.gz:
Publisher:
python-publish.yml on jeslor/prompt-sanitizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ai_prompt_sanitizer-0.1.0.tar.gz -
Subject digest:
da6dfff78da055f2d2a7e963851147466c9bf322814781597c045e265908914c - Sigstore transparency entry: 1460833331
- Sigstore integration time:
-
Permalink:
jeslor/prompt-sanitizer@842eeb106691feac502bf9822724094a1c6a5f24 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/jeslor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@842eeb106691feac502bf9822724094a1c6a5f24 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ai_prompt_sanitizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ai_prompt_sanitizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd9caa4e1f9c9816f0500b0fad7b8a3b5aead1e83a034f4092d162ec6def8a82
|
|
| MD5 |
f8f16d2f6e3259aae8ad8bd4ab226503
|
|
| BLAKE2b-256 |
feda7bb94f953f0c5022dd7be0b01952a3d60d91a851dceef094aaa4948ac35b
|
Provenance
The following attestation bundles were made for ai_prompt_sanitizer-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on jeslor/prompt-sanitizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ai_prompt_sanitizer-0.1.0-py3-none-any.whl -
Subject digest:
dd9caa4e1f9c9816f0500b0fad7b8a3b5aead1e83a034f4092d162ec6def8a82 - Sigstore transparency entry: 1460833398
- Sigstore integration time:
-
Permalink:
jeslor/prompt-sanitizer@842eeb106691feac502bf9822724094a1c6a5f24 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/jeslor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@842eeb106691feac502bf9822724094a1c6a5f24 -
Trigger Event:
push
-
Statement type: