Skip to main content

Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.

Project description

promptdebug

PyPI version Downloads CI License: MIT Python 3.10+

Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.

promptdebug systematically removes each section of your system prompt and measures how the model's output changes. Sections that can be removed without affecting the output are dead weight — tokens you're paying for that do nothing.

Install

pip install promptdebug

Note: On first run, promptdebug downloads the all-mpnet-base-v2 sentence-transformers model (~420 MB) for semantic scoring. This happens once and is cached locally by the sentence-transformers library.

Set your API key for whichever provider you use:

export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export GEMINI_API_KEY="..."

Quick Start

# Analyze a system prompt
promptdebug analyze prompt.txt --query "I want a refund"

# HTML report
promptdebug analyze prompt.txt --query "I want a refund" --format html

# Analyze across multiple queries for more robust results
promptdebug analyze prompt.txt --queries queries.txt

# Validate analysis reliability with a counterfactual injection
promptdebug analyze prompt.txt --query "test" --sanity-check

# Get rewrite suggestions for dead sections
promptdebug analyze prompt.txt --query "test" --suggest

# Watch mode — re-analyze automatically on every save
promptdebug watch prompt.txt --query "test"

# Compare influence between git versions
promptdebug diff prompt.txt --ref HEAD~1 --query "test"

# Compare across models
promptdebug compare prompt.txt --query "test query" --models gpt-4o-mini,claude-haiku-4-5

# Strip dead sections and output a cleaned prompt
promptdebug optimize prompt.txt --query "test query"

# Dry run (no API calls, shows cost estimate)
promptdebug analyze prompt.txt --query "test" --dry-run

How It Works

  1. Parse — Your system prompt is split into sections using automatic strategy detection (markdown headers, XML tags, labeled blocks, numbered lists, or paragraph breaks).

  2. Baseline — The full prompt is sent to the model N times to establish baseline outputs.

  3. Ablate — Each section is removed one at a time. The ablated prompt is sent to the model N times.

  4. Score — Each section gets a composite influence score:

influence = 0.60 × semantic + 0.20 × structural + 0.20 × behavioral
  • Semantic — cosine distance between sentence embeddings of baseline vs. ablated output
  • Structural — character-level diff + paragraph/bullet/code block feature distance
  • Behavioral — format-appropriate signals (JSON field match, classification exact match, or surface signals for free text)
  1. Classify — Sections with influence < 0.10 are classified as dead.

Output Example

Section 1: Role definition          [████████  ] 0.82  HIGH
Section 2: Output format rules      [████      ] 0.44  MEDIUM
Section 3: Tone guidelines          [█         ] 0.12  LOW
Section 4: Legacy constraint note   [          ] 0.03  DEAD
Section 5: Core task instruction    [███████   ] 0.71  HIGH

Dead token rate: 14.2% (127 / 894 tokens)
Estimated savings: ~$0.02 per 1K calls

Commands

analyze — influence heatmap for a prompt

promptdebug analyze prompt.txt --query "test query"

# Options
--queries FILE       Text file with one query per line (multi-query mode)
--model MODEL        LLM to use (default: gpt-4o-mini)
--runs N             API calls per ablation (default: 3)
--temperature FLOAT  Sampling temperature (default: 0.3)
--format FORMAT      terminal | html | json | csv (default: terminal)
--dead-threshold F   Influence below this is dead (default: 0.10)
--sanity-check       Inject a counterfactual section; warn if not detected
--suggest            Generate LLM rewrite suggestions for dead sections
--dry-run            Estimate cost without making API calls

watch — re-analyze on every file save

promptdebug watch prompt.txt --query "test query"

# Options
--interval SECONDS   Poll interval in seconds (default: 5)
--threshold FLOAT    Re-print only when dead rate changes by this much

diff — compare influence between git revisions

promptdebug diff prompt.txt --ref HEAD~1 --query "test query"

# Options
--ref REF   Git ref to compare against (default: HEAD~1)

compare — side-by-side multi-model comparison

promptdebug compare prompt.txt --query "test" --models gpt-4o-mini,claude-haiku-4-5

optimize — output a cleaned prompt with dead sections removed

promptdebug optimize prompt.txt --query "test"

Output Formats

Format Flag Description
Terminal --format terminal Rich heatmap (default)
HTML --format html Interactive report, opens in browser
JSON --format json Machine-readable export
CSV --format csv Spreadsheet-friendly export

Multi-Query Mode

Single-query analysis can be noisy — a section that looks dead for one query may be critical for another. Multi-query mode runs ablation across several test queries and aggregates the scores, giving a more stable, query-independent result:

# queries.txt — one query per line
printf "I want a refund\nMy login is broken\nHow do I cancel?\n" > queries.txt
promptdebug analyze prompt.txt --queries queries.txt

Sanity Check

Before acting on dead-section results, verify the scoring engine is working correctly for your specific prompt and query. The sanity check injects a known-high-influence instruction and confirms it scores above 0.5. If it doesn't, the analysis may be unreliable:

promptdebug analyze prompt.txt --query "test" --sanity-check
# ✓ Sanity check passed (score: 0.73)
# ⚠ Sanity check failed (score: 0.31) — results may be unreliable for this prompt/query

Watch Mode

Iterate on your prompt and see the influence change in real time:

promptdebug watch prompt.txt --query "I want a refund" --interval 10
# Watching prompt.txt (every 10s) ...
# [14:32:07] Change detected — re-analyzing ...
# ...heatmap...
# [14:35:22] Change detected — re-analyzing ...

Configuration

Create a .promptdebug.yml in your project directory (or any parent directory):

model: gpt-4o-mini
runs: 3
temperature: 0.3
dead_threshold: 0.10
cache_expire_days: 7
weights:
  semantic: 0.6
  structural: 0.2
  behavioral: 0.2

All fields are optional. Defaults are shown above.

Supported Models

Any model supported by LiteLLM:

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, ...
  • Anthropic: claude-sonnet-4-5, claude-haiku-4-5, ...
  • Google: gemini/gemini-2.0-flash, gemini/gemini-1.5-pro, ...
  • Mistral: mistral/mistral-large-latest, ...
  • Local: ollama/llama3, ollama/codellama, ...

Caching

API responses are cached in a local SQLite database (.promptdebug_cache.db) using SHA256 content-hash keys. Cache auto-expires after 7 days (configurable). Re-running the same analysis costs zero API calls.

Python API

import asyncio
from promptdebug import (
    run_ablation,
    run_ablation_multi_query,
    run_sanity_check,
    generate_all_suggestions,
    render_terminal,
    LLMProvider,
    Cache,
)

async def main():
    provider = LLMProvider(model="gpt-4o-mini")
    cache = Cache()

    # Single-query ablation
    result = await run_ablation(
        prompt_text="You are a helpful assistant. ...",
        query="Hello, how can you help me?",
        provider=provider,
        cache=cache,
        runs=3,
    )

    render_terminal(result, model="gpt-4o-mini", runs=3)

    # Multi-query ablation (aggregated)
    aggregated, per_query = await run_ablation_multi_query(
        prompt_text="...",
        queries=["query 1", "query 2", "query 3"],
        provider=provider,
        runs=3,
    )

    # Sanity check — validate scoring reliability
    passed, score = await run_sanity_check(
        prompt_text="...",
        query="test query",
        provider=provider,
    )
    print(f"Sanity check: {'passed' if passed else 'FAILED'} (score={score:.2f})")

    # Get rewrite suggestions for dead sections
    suggestions = await generate_all_suggestions(
        section_results=result.sections,
        provider=provider,
        threshold=0.2,
    )
    for section_idx, rewrites in suggestions.items():
        print(f"Section {section_idx} suggestions:")
        for s in rewrites:
            print(f"  → {s}")

asyncio.run(main())

Development

git clone https://github.com/entropyvector/promptdebug.git
cd promptdebug
pip install -e ".[dev]"

# Run unit tests (762 tests, no API key required)
python -m pytest tests/ --ignore=tests/test_integration.py

# Run integration tests (requires OPENAI_API_KEY)
python -m pytest tests/test_integration.py -v

License

MIT

Third-Party Licenses

See THIRD_PARTY_LICENSES.md for a full list of dependencies and their licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptdebug-0.2.0.tar.gz (116.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

promptdebug-0.2.0-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file promptdebug-0.2.0.tar.gz.

File metadata

  • Download URL: promptdebug-0.2.0.tar.gz
  • Upload date:
  • Size: 116.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for promptdebug-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2ea461516ccc97706919381c04efbd11ad9e7689c9af5e27ce88d0f36e0a6676
MD5 ca8dbf63bba266a755424d23e0f79f9c
BLAKE2b-256 653b6dcf57f92ec504fae0f858e7dd9e2f288848a49b6cfe05afa4be05290d5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptdebug-0.2.0.tar.gz:

Publisher: publish.yml on entropyvector/promptdebug

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file promptdebug-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: promptdebug-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for promptdebug-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c9c7c52837f7b3ad23251d2c91e6fc100aadc8b8b54544a334eb0fa03ebc05e
MD5 8327bcbaa8e14508653778ce66023a74
BLAKE2b-256 df155a3a6eb831737e652d93c09cd80106e172483f66afc697eca21f99071798

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptdebug-0.2.0-py3-none-any.whl:

Publisher: publish.yml on entropyvector/promptdebug

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page