Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.
Project description
promptdebug
Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.
promptdebug systematically removes each section of your system prompt and measures how the model's output changes. Sections that can be removed without affecting the output are dead weight — tokens you're paying for that do nothing.
Install
pip install promptdebug
Note: On first run, promptdebug downloads the
all-mpnet-base-v2sentence-transformers model (~420 MB) for semantic scoring. This happens once and is cached locally by thesentence-transformerslibrary.
Set your API key for whichever provider you use:
export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export GEMINI_API_KEY="..."
Quick Start
# Analyze a system prompt
promptdebug analyze prompt.txt --query "I want a refund"
# HTML report
promptdebug analyze prompt.txt --query "I want a refund" --format html
# Analyze across multiple queries for more robust results
promptdebug analyze prompt.txt --queries queries.txt
# Validate analysis reliability with a counterfactual injection
promptdebug analyze prompt.txt --query "test" --sanity-check
# Get rewrite suggestions for dead sections
promptdebug analyze prompt.txt --query "test" --suggest
# Watch mode — re-analyze automatically on every save
promptdebug watch prompt.txt --query "test"
# Compare influence between git versions
promptdebug diff prompt.txt --ref HEAD~1 --query "test"
# Compare across models
promptdebug compare prompt.txt --query "test query" --models gpt-4o-mini,claude-haiku-4-5
# Strip dead sections and output a cleaned prompt
promptdebug optimize prompt.txt --query "test query"
# Dry run (no API calls, shows cost estimate)
promptdebug analyze prompt.txt --query "test" --dry-run
How It Works
-
Parse — Your system prompt is split into sections using automatic strategy detection (markdown headers, XML tags, labeled blocks, numbered lists, or paragraph breaks).
-
Baseline — The full prompt is sent to the model N times to establish baseline outputs.
-
Ablate — Each section is removed one at a time. The ablated prompt is sent to the model N times.
-
Score — Each section gets a composite influence score:
influence = 0.60 × semantic + 0.20 × structural + 0.20 × behavioral
- Semantic — cosine distance between sentence embeddings of baseline vs. ablated output
- Structural — character-level diff + paragraph/bullet/code block feature distance
- Behavioral — format-appropriate signals (JSON field match, classification exact match, or surface signals for free text)
- Classify — Sections with influence < 0.10 are classified as dead.
Output Example
Section 1: Role definition [████████ ] 0.82 HIGH
Section 2: Output format rules [████ ] 0.44 MEDIUM
Section 3: Tone guidelines [█ ] 0.12 LOW
Section 4: Legacy constraint note [ ] 0.03 DEAD
Section 5: Core task instruction [███████ ] 0.71 HIGH
Dead token rate: 14.2% (127 / 894 tokens)
Estimated savings: ~$0.02 per 1K calls
Commands
analyze — influence heatmap for a prompt
promptdebug analyze prompt.txt --query "test query"
# Options
--queries FILE Text file with one query per line (multi-query mode)
--model MODEL LLM to use (default: gpt-4o-mini)
--runs N API calls per ablation (default: 3)
--temperature FLOAT Sampling temperature (default: 0.3)
--format FORMAT terminal | html | json | csv (default: terminal)
--dead-threshold F Influence below this is dead (default: 0.10)
--sanity-check Inject a counterfactual section; warn if not detected
--suggest Generate LLM rewrite suggestions for dead sections
--dry-run Estimate cost without making API calls
watch — re-analyze on every file save
promptdebug watch prompt.txt --query "test query"
# Options
--interval SECONDS Poll interval in seconds (default: 5)
--threshold FLOAT Re-print only when dead rate changes by this much
diff — compare influence between git revisions
promptdebug diff prompt.txt --ref HEAD~1 --query "test query"
# Options
--ref REF Git ref to compare against (default: HEAD~1)
compare — side-by-side multi-model comparison
promptdebug compare prompt.txt --query "test" --models gpt-4o-mini,claude-haiku-4-5
optimize — output a cleaned prompt with dead sections removed
promptdebug optimize prompt.txt --query "test"
Output Formats
| Format | Flag | Description |
|---|---|---|
| Terminal | --format terminal |
Rich heatmap (default) |
| HTML | --format html |
Interactive report, opens in browser |
| JSON | --format json |
Machine-readable export |
| CSV | --format csv |
Spreadsheet-friendly export |
Multi-Query Mode
Single-query analysis can be noisy — a section that looks dead for one query may be critical for another. Multi-query mode runs ablation across several test queries and aggregates the scores, giving a more stable, query-independent result:
# queries.txt — one query per line
printf "I want a refund\nMy login is broken\nHow do I cancel?\n" > queries.txt
promptdebug analyze prompt.txt --queries queries.txt
Sanity Check
Before acting on dead-section results, verify the scoring engine is working correctly for your specific prompt and query. The sanity check injects a known-high-influence instruction and confirms it scores above 0.5. If it doesn't, the analysis may be unreliable:
promptdebug analyze prompt.txt --query "test" --sanity-check
# ✓ Sanity check passed (score: 0.73)
# ⚠ Sanity check failed (score: 0.31) — results may be unreliable for this prompt/query
Watch Mode
Iterate on your prompt and see the influence change in real time:
promptdebug watch prompt.txt --query "I want a refund" --interval 10
# Watching prompt.txt (every 10s) ...
# [14:32:07] Change detected — re-analyzing ...
# ...heatmap...
# [14:35:22] Change detected — re-analyzing ...
Configuration
Create a .promptdebug.yml in your project directory (or any parent directory):
model: gpt-4o-mini
runs: 3
temperature: 0.3
dead_threshold: 0.10
cache_expire_days: 7
weights:
semantic: 0.6
structural: 0.2
behavioral: 0.2
All fields are optional. Defaults are shown above.
Supported Models
Any model supported by LiteLLM:
- OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, ...
- Anthropic: claude-sonnet-4-5, claude-haiku-4-5, ...
- Google: gemini/gemini-2.0-flash, gemini/gemini-1.5-pro, ...
- Mistral: mistral/mistral-large-latest, ...
- Local: ollama/llama3, ollama/codellama, ...
Caching
API responses are cached in a local SQLite database (.promptdebug_cache.db) using SHA256 content-hash keys. Cache auto-expires after 7 days (configurable). Re-running the same analysis costs zero API calls.
Python API
import asyncio
from promptdebug import (
run_ablation,
run_ablation_multi_query,
run_sanity_check,
generate_all_suggestions,
render_terminal,
LLMProvider,
Cache,
)
async def main():
provider = LLMProvider(model="gpt-4o-mini")
cache = Cache()
# Single-query ablation
result = await run_ablation(
prompt_text="You are a helpful assistant. ...",
query="Hello, how can you help me?",
provider=provider,
cache=cache,
runs=3,
)
render_terminal(result, model="gpt-4o-mini", runs=3)
# Multi-query ablation (aggregated)
aggregated, per_query = await run_ablation_multi_query(
prompt_text="...",
queries=["query 1", "query 2", "query 3"],
provider=provider,
runs=3,
)
# Sanity check — validate scoring reliability
passed, score = await run_sanity_check(
prompt_text="...",
query="test query",
provider=provider,
)
print(f"Sanity check: {'passed' if passed else 'FAILED'} (score={score:.2f})")
# Get rewrite suggestions for dead sections
suggestions = await generate_all_suggestions(
section_results=result.sections,
provider=provider,
threshold=0.2,
)
for section_idx, rewrites in suggestions.items():
print(f"Section {section_idx} suggestions:")
for s in rewrites:
print(f" → {s}")
asyncio.run(main())
Development
git clone https://github.com/entropyvector/promptdebug.git
cd promptdebug
pip install -e ".[dev]"
# Run unit tests (762 tests, no API key required)
python -m pytest tests/ --ignore=tests/test_integration.py
# Run integration tests (requires OPENAI_API_KEY)
python -m pytest tests/test_integration.py -v
License
Third-Party Licenses
See THIRD_PARTY_LICENSES.md for a full list of dependencies and their licenses.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promptdebug-0.2.0.tar.gz.
File metadata
- Download URL: promptdebug-0.2.0.tar.gz
- Upload date:
- Size: 116.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ea461516ccc97706919381c04efbd11ad9e7689c9af5e27ce88d0f36e0a6676
|
|
| MD5 |
ca8dbf63bba266a755424d23e0f79f9c
|
|
| BLAKE2b-256 |
653b6dcf57f92ec504fae0f858e7dd9e2f288848a49b6cfe05afa4be05290d5a
|
Provenance
The following attestation bundles were made for promptdebug-0.2.0.tar.gz:
Publisher:
publish.yml on entropyvector/promptdebug
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
promptdebug-0.2.0.tar.gz -
Subject digest:
2ea461516ccc97706919381c04efbd11ad9e7689c9af5e27ce88d0f36e0a6676 - Sigstore transparency entry: 1071939596
- Sigstore integration time:
-
Permalink:
entropyvector/promptdebug@68c1468ef1c1ec5a5d8c36a6040500e872a554ba -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/entropyvector
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@68c1468ef1c1ec5a5d8c36a6040500e872a554ba -
Trigger Event:
release
-
Statement type:
File details
Details for the file promptdebug-0.2.0-py3-none-any.whl.
File metadata
- Download URL: promptdebug-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c9c7c52837f7b3ad23251d2c91e6fc100aadc8b8b54544a334eb0fa03ebc05e
|
|
| MD5 |
8327bcbaa8e14508653778ce66023a74
|
|
| BLAKE2b-256 |
df155a3a6eb831737e652d93c09cd80106e172483f66afc697eca21f99071798
|
Provenance
The following attestation bundles were made for promptdebug-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on entropyvector/promptdebug
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
promptdebug-0.2.0-py3-none-any.whl -
Subject digest:
2c9c7c52837f7b3ad23251d2c91e6fc100aadc8b8b54544a334eb0fa03ebc05e - Sigstore transparency entry: 1071939718
- Sigstore integration time:
-
Permalink:
entropyvector/promptdebug@68c1468ef1c1ec5a5d8c36a6040500e872a554ba -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/entropyvector
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@68c1468ef1c1ec5a5d8c36a6040500e872a554ba -
Trigger Event:
release
-
Statement type: