A lightweight Python library for optimizing and cleaning LLM inputs
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
Prompt Groomer
🧹 A lightweight Python library for optimizing and cleaning LLM inputs. Save 10-20% on API costs by removing invisible tokens, stripping HTML, and redacting PII.
⭐ If you find this useful, please star us on GitHub! ⭐
🎯 Perfect for:
RAG Applications • Chatbots • Document Processing • Production LLM Apps • Cost Optimization
Why use Prompt Groomer?
Stop paying for invisible tokens and dirty data.
| Feature | Before (Dirty Input) | After (Groomed) |
|---|---|---|
| HTML Cleaning | <div><b>Hello</b> world</div> |
Hello world |
| Whitespace | User input\n\n\n here |
User input here |
| PII Redaction | Call me at 555-0199 |
Call me at [PHONE] |
| Deduplication | Same text.\n\nSame text.\n\nDifferent. |
Same text.\n\nDifferent. |
| Token Cost | ❌ 150 Tokens | ✅ 85 Tokens (Saved 43%) |
📦 It's this easy:
from prompt_groomer import StripHTML, NormalizeWhitespace
cleaned = (StripHTML() | NormalizeWhitespace()).run(dirty_input)
✨ Key Features
- 🪶 Zero Dependencies - Lightweight core with no external dependencies
- 🔧 Modular Design - 4 focused modules: Cleaner, Compressor, Scrubber, Analyzer
- ⚡ Production Ready - Battle-tested operations with comprehensive test coverage
- 🎯 Type Safe - Full type hints for better IDE support and fewer bugs
- 📦 Easy to Use - Modern pipe operator syntax (
|), compose operations like LEGO blocks
Overview
Prompt Groomer helps you clean and optimize prompts before sending them to LLM APIs. By removing unnecessary whitespace, duplicate characters, and other inefficiencies, you can:
- Reduce token usage and API costs
- Improve prompt quality and consistency
- Process inputs more efficiently
Status
This project is in early development. Features are being added iteratively.
Installation
# Using uv (recommended)
uv pip install prompt-groomer
# Using pip
pip install prompt-groomer
Quick Start
from prompt_groomer import StripHTML, NormalizeWhitespace, TruncateTokens
# ✨ The Pythonic "Pipe" Syntax (Recommended)
pipeline = (
StripHTML()
| NormalizeWhitespace()
| TruncateTokens(max_tokens=1000)
)
raw_input = "<div> User input with <b>lots</b> of spaces... </div>"
clean_prompt = pipeline.run(raw_input)
# Output: "User input with lots of spaces..."
Alternative: Fluent API
Prefer method chaining? Use the traditional fluent API:
from prompt_groomer import Groomer, StripHTML, NormalizeWhitespace, TruncateTokens
pipeline = (
Groomer()
.pipe(StripHTML())
.pipe(NormalizeWhitespace())
.pipe(TruncateTokens(max_tokens=1000))
)
clean_prompt = pipeline.run(raw_input)
💡 Why pipe operator? More concise, Pythonic, and familiar to LangChain/LangGraph users.
📊 Proven Effectiveness
We benchmarked Prompt Groomer on 30 real-world test cases (SQuAD + RAG scenarios) to measure token reduction and response quality:
| Strategy | Token Reduction | Quality (Cosine) | Judge Approval | Overall Equivalent |
|---|---|---|---|---|
| Minimal | 4.3% | 0.987 | 86.7% | 86.7% |
| Standard | 4.8% | 0.984 | 90.0% | 86.7% |
| Aggressive | 15.0% | 0.964 | 80.0% | 66.7% |
Key Insights:
- Aggressive strategy achieves 3x more savings (15%) vs Minimal while maintaining 96.4% quality
- Individual RAG tests showed 17-74% token savings with aggressive strategy
- Deduplicate (Standard) shows minimal gains on typical RAG contexts
- TruncateTokens (Aggressive) provides the largest cost reduction for long contexts
- Trade-off: More aggressive = more savings but slightly lower judge approval
Example: RAG with duplicates
- Minimal (HTML + Whitespace): 17% reduction
- Standard (+ Deduplicate): 31% reduction
- Aggressive (+ Truncate 150 tokens): 49% reduction 🎉
💰 Cost Savings: At scale (1M tokens/month), 15% reduction saves ~$54/month on GPT-4 input tokens.
📖 See full benchmark: benchmark/custom/README.md
🎮 Interactive Demo
Try prompt-groomer in your browser - no installation required!
Play with different strategies, see real-time token savings, and find the perfect configuration for your use case. Features:
- 🎯 6 preset examples (e-commerce, support tickets, docs, RAG, etc.)
- ⚡ Quick strategy presets (Minimal, Standard, Aggressive)
- 💰 Real-time cost savings calculator
- 🔧 All 7 operations configurable
- 📊 Visual metrics dashboard
4 Core Modules
Prompt Groomer is organized into 4 specialized modules:
1. Cleaner - Clean Dirty Data
StripHTML()- Remove HTML tags, convert to MarkdownNormalizeWhitespace()- Collapse excessive whitespaceFixUnicode()- Remove zero-width spaces and problematic Unicode
2. Compressor - Reduce Size
TruncateTokens()- Smart truncation with sentence boundaries- Strategies:
"head","tail","middle_out"
- Strategies:
Deduplicate()- Remove similar content (great for RAG)
3. Scrubber - Security & Privacy
RedactPII()- Automatically redact emails, phones, IPs, credit cards, URLs, SSNs
4. Analyzer - Show Value
CountTokens()- Track token savings and optimization impact
Complete Example
from prompt_groomer import (
# Cleaner
StripHTML, NormalizeWhitespace, FixUnicode,
# Compressor
Deduplicate, TruncateTokens,
# Scrubber
RedactPII,
# Analyzer
CountTokens
)
original_text = """<div>Your messy input here...</div>"""
# Create token counter to track savings
counter = CountTokens(original_text=original_text)
# Build the complete pipeline with all 4 modules
pipeline = (
StripHTML(to_markdown=True)
| NormalizeWhitespace()
| FixUnicode()
| Deduplicate(similarity_threshold=0.85)
| TruncateTokens(max_tokens=500, strategy="head")
| RedactPII(redact_types={"email", "phone"})
)
# Run and analyze
result = pipeline.run(original_text)
counter.process(result)
print(counter.format_stats())
# Output:
# Original: 8 tokens
# Cleaned: 5 tokens
# Saved: 3 tokens (37.5%)
Examples
Check out the examples/ folder for detailed examples organized by module:
cleaner/- HTML cleaning, whitespace normalization, Unicode fixingcompressor/- Smart truncation, deduplicationscrubber/- PII redactionanalyzer/- Token counting and cost savingsall_modules_demo.py- Complete demonstration
Development
This project uses uv for dependency management and make for common tasks.
# Install dependencies
make install
# Run tests
make test
# Format code
make format
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prompt_groomer-0.2.1.tar.gz.
File metadata
- Download URL: prompt_groomer-0.2.1.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2069480bf360281175f76be53e0bbb856e436f5abd55646749e83f5f12c92b89
|
|
| MD5 |
4d5c274abbd4b4b9f391d34e60b61e90
|
|
| BLAKE2b-256 |
d368add45a9c36344e98c21a91035edf73f90d97f6fc0298e22a40d25e4b22c5
|
File details
Details for the file prompt_groomer-0.2.1-py3-none-any.whl.
File metadata
- Download URL: prompt_groomer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb74547c863db48fa26d016b5d9a4e4175e8347ddb1d778e3ebcbd819329d6fe
|
|
| MD5 |
e372d507be1f7f5ed0df91146eb58632
|
|
| BLAKE2b-256 |
cb9768a83b5f9162e68781d45de3dad5cf1b585bb5926af99f15fb460d1b3640
|