No project description provided
Project description
Sanitext
Sanitize text from LLMs
Sanitext is a command-line tool and Python library for detecting and removing unwanted characters in text. It supports:
- ASCII-only sanitization (default)
- Custom character allowlists (
--allow-chars,--allow-file) - Interactive review of non-allowed characters (
--interactive)
Installation
pip install sanitext
By default, sanitext uses the string in your clipboard unless you specify one with --string.
CLI usage example
# Process the clipboard content & copy back to clipboard
sanitext
# Detect characters but don't modify
sanitext --detect
# Process clipboard + show detected characters (most common command)
sanitext -v
# Process clipboard + show input, detected characters & output
sanitext -vv
# Process the provided string and print it
sanitext --string "Héllø, 𝒲𝑜𝓇𝓁𝒹!"
# Allow additional characters (for now, only single unicode code point characters)
sanitext --allow-chars "αøñç"
# Allow characters from a file
sanitext --allow-file allowed_chars.txt
# Allow single code point emoji
sanitext --allow-emoji
# Prompt user for handling disallowed characters
# y (Yes) -> keep it
# n (No) -> remove it
# r (Replace) -> provide a replacement character
sanitext --interactive
# Allow emojis
sanitext --allow-emoji
Python library usage example
from sanitext.text_sanitization import (
sanitize_text,
detect_suspicious_characters,
get_allowed_characters,
)
text = "“2×3 – 4 = 5”😎󠅒󠅟󠅣󠅣"
# Detect suspicious characters
suspicious_characters = detect_suspicious_characters(text)
print(f"Suspicious characters: {suspicious_characters}")
# [('“', 'LEFT DOUBLE QUOTATION MARK'), ('×', 'MULTIPLICATION SIGN'), ('–', 'EN DASH'), ('”', 'RIGHT DOUBLE QUOTATION MARK')]
# Sanitize text to all ASCII
sanitized_text = sanitize_text(text)
print(f"Sanitized text: {sanitized_text}") # "2x3 - 4 = 5"
# Allow the multiplication sign
allowed_characters = get_allowed_characters()
allowed_characters.add("×")
sanitized_text = sanitize_text(text, allowed_characters=allowed_characters)
print(f"Sanitized text: {sanitized_text}") # "2×3 - 4 = 5"
# Allow the emoji (but clean it from the encoded message "boss")
allowed_characters = get_allowed_characters(allow_emoji=True)
sanitized_text = sanitize_text(text, allowed_characters=allowed_characters)
print(f"Sanitized text: {sanitized_text}") # "2x3 - 4 = 5"😎
Dev setup
# Install dependencies
poetry install
# Use it
poetry run python sanitext/cli.py --help
poetry run python sanitext/cli.py --string "your string"
# Run tests
poetry run pytest
poetry run pytest -s tests/test_cli.py
# Run tests over different python versions (TODO: setup github action)
poetry run tox
# Publish to PyPI
poetry build
poetry publish
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sanitext-0.1.1.tar.gz
(17.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
sanitext-0.1.1-py3-none-any.whl
(16.9 kB
view details)
File details
Details for the file sanitext-0.1.1.tar.gz.
File metadata
- Download URL: sanitext-0.1.1.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.9.18 Darwin/22.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2cc14baf7e08460dd20d1bacd125924581a3a4cf1fb4d1e51ec0b48ebc97e57
|
|
| MD5 |
faa6895138a58555dc179947da028749
|
|
| BLAKE2b-256 |
0d321ff2109ac972a90b6ce1eecb2dff5360fc3c96ee362989aeaac6cc7f70e1
|
File details
Details for the file sanitext-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sanitext-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.9.18 Darwin/22.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35106ce61c4d8d64e6987929ff3fe1cb5a579db13be48d90616b6a6e7dd57444
|
|
| MD5 |
b5aa31214b6386add41b6181200cd71e
|
|
| BLAKE2b-256 |
89e3500926e26b92b4a72bc18de336851a123d02c230bd3f3bec2f827b4a7fca
|