Skip to main content

No project description provided

Project description

Sanitext

Sanitize text from LLMs

Sanitext is a command-line tool and Python library for detecting and removing unwanted characters in text. It supports:

  • ASCII-only sanitization (default)
  • Custom character allowlists (--allow-chars, --allow-file)
  • Interactive review of non-allowed characters (--interactive)

Installation

pip install sanitext

By default, sanitext uses the string in your clipboard unless you specify one with --string.

CLI usage example

# Process the clipboard content & copy back to clipboard
sanitext
# Detect characters but don't modify
sanitext --detect
# Process clipboard + show detected characters (most common command)
sanitext -v
# Process clipboard + show input, detected characters & output
sanitext -vv
# Process the provided string and print it
sanitext --string "Héllø, 𝒲𝑜𝓇𝓁𝒹!"
# Allow additional characters (for now, only single unicode code point characters)
sanitext --allow-chars "αøñç"
# Allow characters from a file
sanitext --allow-file allowed_chars.txt
# Allow single code point emoji
sanitext --allow-emoji
# Prompt user for handling disallowed characters
# y (Yes) -> keep it
# n (No) -> remove it
# r (Replace) -> provide a replacement character
sanitext --interactive
# Allow emojis
sanitext --allow-emoji

Python library usage example

from sanitext.text_sanitization import (
    sanitize_text,
    detect_suspicious_characters,
    get_allowed_characters,
)

text = "“2×3 – 4 = 5”😎󠅒󠅟󠅣󠅣"

# Detect suspicious characters
suspicious_characters = detect_suspicious_characters(text)
print(f"Suspicious characters: {suspicious_characters}")
# [('“', 'LEFT DOUBLE QUOTATION MARK'), ('×', 'MULTIPLICATION SIGN'), ('–', 'EN DASH'), ('”', 'RIGHT DOUBLE QUOTATION MARK')]

# Sanitize text to all ASCII
sanitized_text = sanitize_text(text)
print(f"Sanitized text: {sanitized_text}")  # "2x3 - 4 = 5"
# Allow the multiplication sign
allowed_characters = get_allowed_characters()
allowed_characters.add("×")
sanitized_text = sanitize_text(text, allowed_characters=allowed_characters)
print(f"Sanitized text: {sanitized_text}")  # "2×3 - 4 = 5"
# Allow the emoji (but clean it from the encoded message "boss")
allowed_characters = get_allowed_characters(allow_emoji=True)
sanitized_text = sanitize_text(text, allowed_characters=allowed_characters)
print(f"Sanitized text: {sanitized_text}")  # "2x3 - 4 = 5"😎

Dev setup

# Install dependencies
poetry install
# Use it
poetry run python sanitext/cli.py --help
poetry run python sanitext/cli.py --string "your string"
# Run tests
poetry run pytest
poetry run pytest -s tests/test_cli.py
# Run tests over different python versions (TODO: setup github action)
poetry run tox
# Publish to PyPI
poetry build
poetry publish

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanitext-0.1.0.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sanitext-0.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file sanitext-0.1.0.tar.gz.

File metadata

  • Download URL: sanitext-0.1.0.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.11 Darwin/22.6.0

File hashes

Hashes for sanitext-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a7d258bf7c008e7b42467a3ff98f0cec8ebb2522d63d5f9b0d70e0d20aeef590
MD5 7c6c5dabbd6400ac10d8e14b00955a91
BLAKE2b-256 cf3fe0671c57c98cf76b1cc0ab819ef341a051d69f569f7485a86d3b3d3aef01

See more details on using hashes here.

File details

Details for the file sanitext-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sanitext-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.11 Darwin/22.6.0

File hashes

Hashes for sanitext-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dbd8021db957a305c78df8b43e2ce57289b1c0993feeece3b56ba8534ad80379
MD5 a8813407f06a19f402b7dbf15b1e0206
BLAKE2b-256 fc28735d9c2d0dc1d411444a4371b11d6f0692fb5fd1ac9a6d365da1bf96629a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page