Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

entropyvector

These details have not been verified by PyPI

Project description

promptdebug

Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.

promptdebug systematically removes each section of your system prompt and measures how the model's output changes. Sections that can be removed without affecting the output are dead weight — tokens you're paying for that do nothing.

Install

pip install promptdebug

Note: On first run, promptdebug downloads the all-mpnet-base-v2 sentence-transformers model (~420 MB) for semantic scoring. This happens once and is cached locally by the sentence-transformers library.

Set your API key for whichever provider you use:

export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export GEMINI_API_KEY="..."

Quick Start

# Analyze a system prompt
promptdebug analyze prompt.txt --query "I want a refund"

# HTML report
promptdebug analyze prompt.txt --query "I want a refund" --format html

# Analyze across multiple queries for more robust results
promptdebug analyze prompt.txt --queries queries.txt

# Validate analysis reliability with a counterfactual injection
promptdebug analyze prompt.txt --query "test" --sanity-check

# Get rewrite suggestions for dead sections
promptdebug analyze prompt.txt --query "test" --suggest

# Watch mode — re-analyze automatically on every save
promptdebug watch prompt.txt --query "test"

# Compare influence between git versions
promptdebug diff prompt.txt --ref HEAD~1 --query "test"

# Compare across models
promptdebug compare prompt.txt --query "test query" --models gpt-4o-mini,claude-haiku-4-5

# Strip dead sections and output a cleaned prompt
promptdebug optimize prompt.txt --query "test query"

# Dry run (no API calls, shows cost estimate)
promptdebug analyze prompt.txt --query "test" --dry-run

How It Works

Parse — Your system prompt is split into sections using automatic strategy detection (markdown headers, XML tags, labeled blocks, numbered lists, or paragraph breaks).
Baseline — The full prompt is sent to the model N times to establish baseline outputs.
Ablate — Each section is removed one at a time. The ablated prompt is sent to the model N times.
Score — Each section gets a composite influence score:

influence = 0.60 × semantic + 0.20 × structural + 0.20 × behavioral

Semantic — cosine distance between sentence embeddings of baseline vs. ablated output
Structural — character-level diff + paragraph/bullet/code block feature distance
Behavioral — format-appropriate signals (JSON field match, classification exact match, or surface signals for free text)

Classify — Sections with influence < 0.10 are classified as dead.

Output Example

Section 1: Role definition          [████████  ] 0.82  HIGH
Section 2: Output format rules      [████      ] 0.44  MEDIUM
Section 3: Tone guidelines          [█         ] 0.12  LOW
Section 4: Legacy constraint note   [          ] 0.03  DEAD
Section 5: Core task instruction    [███████   ] 0.71  HIGH

Dead token rate: 14.2% (127 / 894 tokens)
Estimated savings: ~$0.02 per 1K calls

Commands

`analyze` — influence heatmap for a prompt

promptdebug analyze prompt.txt --query "test query"

# Options
--queries FILE       Text file with one query per line (multi-query mode)
--model MODEL        LLM to use (default: gpt-4o-mini)
--runs N             API calls per ablation (default: 3)
--temperature FLOAT  Sampling temperature (default: 0.3)
--format FORMAT      terminal | html | json | csv (default: terminal)
--dead-threshold F   Influence below this is dead (default: 0.10)
--sanity-check       Inject a counterfactual section; warn if not detected
--suggest            Generate LLM rewrite suggestions for dead sections
--dry-run            Estimate cost without making API calls

`watch` — re-analyze on every file save

promptdebug watch prompt.txt --query "test query"

# Options
--interval SECONDS   Poll interval in seconds (default: 5)
--threshold FLOAT    Re-print only when dead rate changes by this much

`diff` — compare influence between git revisions

promptdebug diff prompt.txt --ref HEAD~1 --query "test query"

# Options
--ref REF   Git ref to compare against (default: HEAD~1)

`compare` — side-by-side multi-model comparison

promptdebug compare prompt.txt --query "test" --models gpt-4o-mini,claude-haiku-4-5

`optimize` — output a cleaned prompt with dead sections removed

promptdebug optimize prompt.txt --query "test"

Output Formats

Format	Flag	Description
Terminal	`--format terminal`	Rich heatmap (default)
HTML	`--format html`	Interactive report, opens in browser
JSON	`--format json`	Machine-readable export
CSV	`--format csv`	Spreadsheet-friendly export

Multi-Query Mode

Single-query analysis can be noisy — a section that looks dead for one query may be critical for another. Multi-query mode runs ablation across several test queries and aggregates the scores, giving a more stable, query-independent result:

# queries.txt — one query per line
printf "I want a refund\nMy login is broken\nHow do I cancel?\n" > queries.txt
promptdebug analyze prompt.txt --queries queries.txt

Sanity Check

Before acting on dead-section results, verify the scoring engine is working correctly for your specific prompt and query. The sanity check injects a known-high-influence instruction and confirms it scores above 0.5. If it doesn't, the analysis may be unreliable:

promptdebug analyze prompt.txt --query "test" --sanity-check
# ✓ Sanity check passed (score: 0.73)
# ⚠ Sanity check failed (score: 0.31) — results may be unreliable for this prompt/query

Watch Mode

Iterate on your prompt and see the influence change in real time:

promptdebug watch prompt.txt --query "I want a refund" --interval 10
# Watching prompt.txt (every 10s) ...
# [14:32:07] Change detected — re-analyzing ...
# ...heatmap...
# [14:35:22] Change detected — re-analyzing ...

Configuration

Create a .promptdebug.yml in your project directory (or any parent directory):

model: gpt-4o-mini
runs: 3
temperature: 0.3
dead_threshold: 0.10
cache_expire_days: 7
weights:
  semantic: 0.6
  structural: 0.2
  behavioral: 0.2

All fields are optional. Defaults are shown above.

Supported Models

Any model supported by LiteLLM:

OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, ...
Anthropic: claude-sonnet-4-5, claude-haiku-4-5, ...
Google: gemini/gemini-2.0-flash, gemini/gemini-1.5-pro, ...
Mistral: mistral/mistral-large-latest, ...
Local: ollama/llama3, ollama/codellama, ...

Caching

API responses are cached in a local SQLite database (.promptdebug_cache.db) using SHA256 content-hash keys. Cache auto-expires after 7 days (configurable). Re-running the same analysis costs zero API calls.

Python API

import asyncio
from promptdebug import (
    run_ablation,
    run_ablation_multi_query,
    run_sanity_check,
    generate_all_suggestions,
    render_terminal,
    LLMProvider,
    Cache,
)

async def main():
    provider = LLMProvider(model="gpt-4o-mini")
    cache = Cache()

    # Single-query ablation
    result = await run_ablation(
        prompt_text="You are a helpful assistant. ...",
        query="Hello, how can you help me?",
        provider=provider,
        cache=cache,
        runs=3,
    )

    render_terminal(result, model="gpt-4o-mini", runs=3)

    # Multi-query ablation (aggregated)
    aggregated, per_query = await run_ablation_multi_query(
        prompt_text="...",
        queries=["query 1", "query 2", "query 3"],
        provider=provider,
        runs=3,
    )

    # Sanity check — validate scoring reliability
    passed, score = await run_sanity_check(
        prompt_text="...",
        query="test query",
        provider=provider,
    )
    print(f"Sanity check: {'passed' if passed else 'FAILED'} (score={score:.2f})")

    # Get rewrite suggestions for dead sections
    suggestions = await generate_all_suggestions(
        section_results=result.sections,
        provider=provider,
        threshold=0.2,
    )
    for section_idx, rewrites in suggestions.items():
        print(f"Section {section_idx} suggestions:")
        for s in rewrites:
            print(f"  → {s}")

asyncio.run(main())

Development

git clone https://github.com/entropyvector/promptdebug.git
cd promptdebug
pip install -e ".[dev]"

# Run unit tests (762 tests, no API key required)
python -m pytest tests/ --ignore=tests/test_integration.py

# Run integration tests (requires OPENAI_API_KEY)
python -m pytest tests/test_integration.py -v

License

MIT

Third-Party Licenses

See THIRD_PARTY_LICENSES.md for a full list of dependencies and their licenses.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

entropyvector

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptdebug-0.2.0.tar.gz (116.1 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

promptdebug-0.2.0-py3-none-any.whl (37.1 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file promptdebug-0.2.0.tar.gz.

File metadata

Download URL: promptdebug-0.2.0.tar.gz
Upload date: Mar 10, 2026
Size: 116.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for promptdebug-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2ea461516ccc97706919381c04efbd11ad9e7689c9af5e27ce88d0f36e0a6676`
MD5	`ca8dbf63bba266a755424d23e0f79f9c`
BLAKE2b-256	`653b6dcf57f92ec504fae0f858e7dd9e2f288848a49b6cfe05afa4be05290d5a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptdebug-0.2.0.tar.gz:

Publisher: publish.yml on entropyvector/promptdebug

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: promptdebug-0.2.0.tar.gz
- Subject digest: 2ea461516ccc97706919381c04efbd11ad9e7689c9af5e27ce88d0f36e0a6676
- Sigstore transparency entry: 1071939596
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: entropyvector/promptdebug@68c1468ef1c1ec5a5d8c36a6040500e872a554ba
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/entropyvector
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68c1468ef1c1ec5a5d8c36a6040500e872a554ba
- Trigger Event: release

File details

Details for the file promptdebug-0.2.0-py3-none-any.whl.

File metadata

Download URL: promptdebug-0.2.0-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 37.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for promptdebug-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c9c7c52837f7b3ad23251d2c91e6fc100aadc8b8b54544a334eb0fa03ebc05e`
MD5	`8327bcbaa8e14508653778ce66023a74`
BLAKE2b-256	`df155a3a6eb831737e652d93c09cd80106e172483f66afc697eca21f99071798`

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptdebug-0.2.0-py3-none-any.whl:

Publisher: publish.yml on entropyvector/promptdebug

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: promptdebug-0.2.0-py3-none-any.whl
- Subject digest: 2c9c7c52837f7b3ad23251d2c91e6fc100aadc8b8b54544a334eb0fa03ebc05e
- Sigstore transparency entry: 1071939718
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: entropyvector/promptdebug@68c1468ef1c1ec5a5d8c36a6040500e872a554ba
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/entropyvector
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68c1468ef1c1ec5a5d8c36a6040500e872a554ba
- Trigger Event: release

promptdebug 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

promptdebug

Install

Quick Start

How It Works

Output Example

Commands

analyze — influence heatmap for a prompt

watch — re-analyze on every file save

diff — compare influence between git revisions

compare — side-by-side multi-model comparison

optimize — output a cleaned prompt with dead sections removed

Output Formats

Multi-Query Mode

Sanity Check

Watch Mode

Configuration

Supported Models

Caching

Python API

Development

License

Third-Party Licenses

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`analyze` — influence heatmap for a prompt

`watch` — re-analyze on every file save

`diff` — compare influence between git revisions

`compare` — side-by-side multi-model comparison

`optimize` — output a cleaned prompt with dead sections removed