Hostile-text normalization, inspection, and cleaning for LLM-adjacent systems.
Project description
textguard
Hostile-text normalization, inspection, and cleaning for LLM-adjacent systems.
textguard extracts the reusable text-defense work from shisad into a standalone Python package that can scan and clean untrusted text inputs — prompts, Markdown, skill files, and other content — without dragging in daemon or framework dependencies.
Protection Tiers
textguard is organized into three tiers. Each adds detection capability on top of the previous.
| Tier | Install | What it detects | Footprint |
|---|---|---|---|
| Core | pip install textguard |
Invisible chars, bidi abuse, tag chars, soft hyphens, variation selectors, zalgo, homoglyphs/mixed-script, encoding layer abuse (URL, HTML entity, ROT13, base64, Unicode escapes, hex escapes, Punycode) | stdlib-only, small |
| YARA | pip install 'textguard[yara]' |
Pattern-based detection: prompt injection phrases, tool spoofing tags, custom signatures. Runs against both raw and decoded text. | +~6 MB (yara-python) |
| PromptGuard | pip install 'textguard[promptguard]' |
Semantic prompt injection / jailbreak classification via ONNX model. | +~27 MB wheels (onnxruntime, transformers) + ~295 MB model on first fetch |
Core has zero runtime dependencies. Heavy backends are always optional extras.
Design Constraints
- Legitimate multilingual Unicode text is a first-class use case.
- Lossy transforms are always explicit opt-in.
- "Convert everything to ASCII" is not an acceptable default.
- All decode paths have bounded depth and expansion limits.
- Findings carry severity levels (
info,warn,error) — there is no opaque aggregate risk score. - The package does not silently download models or make network requests.
Python API
from textguard import scan, clean, TextGuard
# Quick functional API — uses defaults, zero config
result = scan(text) # ScanResult
cleaned = clean(text) # CleanResult
# Configured instance — reusable, carries backend state
guard = TextGuard(
preset="strict",
split_tokens=True,
yara_rules_dir="./rules/",
promptguard_model_path="~/.local/share/textguard/models/promptguard2/",
)
result = guard.scan(text) # ScanResult
cleaned = guard.clean(text) # CleanResult
semantic = guard.score_semantic(text) # SemanticResult
Top-level scan() and clean() are thin wrappers around TextGuard with default settings. For repeated calls or backend-enabled scanning, create a TextGuard instance.
Results
scan() returns a ScanResult — findings, decode metadata, and optional semantic classification:
result = scan(text)
for f in result.findings:
print(f"{f.severity}: {f.kind} at offset {f.offset} — {f.detail}")
clean() returns a CleanResult — the cleaned text plus a report of what changed:
cleaned = clean(text)
print(cleaned.text) # the safe output
for c in cleaned.changes:
print(f" {c.kind}: {c.detail}")
Presets
Presets control how aggressive cleaning is:
| Preset | Normalization | Strips | Decodes | Use case |
|---|---|---|---|---|
| default | NFC | Tag chars, soft hyphens, whitespace collapse, combining mark cap | No | Safe for all multilingual text including CJK |
| strict | NFKC | All invisibles, bidi, variation selectors, tag chars, soft hyphens | All seven layers | Skill files, prompts, contexts where hidden content is suspect |
| ascii | NFKC + ASCII transliteration | Everything non-ASCII | All seven layers | When you explicitly want ASCII-only output |
The default preset preserves legitimate multilingual text. NFKC is not the default because it destroys semantic content in Japanese and other scripts. Strict and ascii presets opt into progressively more aggressive cleaning.
Scan-time analysis is intentionally more aggressive than clean-time rewriting. scan() always strips hostile formatting and unwinds bounded encodings for analysis; presets control what clean() rewrites into the returned output.
Split-token smuggling detection is opt-in. Enable it with TextGuard(split_tokens=True), TEXTGUARD_SPLIT_TOKENS=1, config file split_tokens = true, or CLI --split-tokens.
CLI
# Scan — read-only, report findings
textguard scan SKILL.md
textguard scan SKILL.md --json
textguard scan docs/*.md --json > report.json
# Clean — output sanitized text
textguard clean SKILL.md # cleaned text to stdout
textguard clean SKILL.md -i # overwrite in place
textguard clean SKILL.md -o out.md # write to file
textguard clean SKILL.md -i --report # overwrite, human-readable report to stderr
cat untrusted.txt | textguard clean - # pipe from stdin
# Presets
textguard clean SKILL.md --preset strict
textguard clean SKILL.md --preset ascii
# Optional backends
textguard scan --yara-rules ./rules/ SKILL.md
textguard scan --no-yara-bundled SKILL.md
textguard scan --split-tokens SKILL.md
textguard scan --promptguard ~/.local/share/textguard/models/promptguard2/ SKILL.md
Exit codes from scan reflect the strongest signal in the result: structural findings map to 0 none, 1 info, 2 warn, 3 error; semantic tiers map to 0 none, 1 medium, 2 high, 3 critical. Runtime failures across subcommands return 4.
Model Management
PromptGuard requires a ~295 MB ONNX model pack (shisa-ai/promptguard2-onnx). textguard never downloads models silently.
# Fetch, verify, and install to the XDG data dir
textguard models fetch promptguard2
# Point textguard to it
export TEXTGUARD_PROMPTGUARD_MODEL=~/.local/share/textguard/models/promptguard2
# Or pass directly
textguard scan --promptguard ~/.local/share/textguard/models/promptguard2 SKILL.md
textguard models fetch promptguard2 downloads from Hugging Face via stdlib HTTP, verifies the SSH ed25519 signature against the bundled public key, and checks SHA-256 file hashes from the manifest before installing under ~/.local/share/textguard/models/promptguard2/ (or XDG_DATA_HOME if set).
Dependency Direction
- Core runtime: stdlib-only. Uses
unicodedata,argparse,hashlib,urllib.request, and vendored generated Unicode data for script ranges and confusables. - Optional YARA:
yara-python>=4.5.4 - Optional PromptGuard:
onnxruntime>=1.24.4,transformers>=5.5.3 - Pinned via floor versions in
pyproject.toml. Exact resolution through committeduv.lockwith hashes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textguard-0.9.0.tar.gz.
File metadata
- Download URL: textguard-0.9.0.tar.gz
- Upload date:
- Size: 105.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fc531406b36818987420bf473917997dbd9458416feccfa6fbd65fcd424bb7f
|
|
| MD5 |
dca25dca544517275fcbcfede853297e
|
|
| BLAKE2b-256 |
339e80b3fcad865f3ec3fd3ec92ec97f7d4a47e409ba083a946ef61f12fa22ff
|
File details
Details for the file textguard-0.9.0-py3-none-any.whl.
File metadata
- Download URL: textguard-0.9.0-py3-none-any.whl
- Upload date:
- Size: 71.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01ae63a19371a7fe5b7e281faf1fbfb3c08ee91fb2bc3cc650a7282ea8f9a5be
|
|
| MD5 |
7213941ba8c709539412b00caa4e3f4d
|
|
| BLAKE2b-256 |
34a01988e0121436edfd50e4775495dda4aaae12488137989c846d1768931832
|