Skip to main content

Decorator-based integration testing for LLM prompts

Project description

prompt-tester

A Python library for testing LLM prompts with statistical reliability using the @prompt_run decorator.

When to use this

prompt-tester is designed for integration tests — tests that run against a real model with real API calls. It is not a mocking or unit-testing framework for LLM code.

The sweet spot is testing prompts that are part of a larger system:

  • Custom tools and function calling — verify the model actually calls the right tool with the right arguments under realistic conditions
  • MCP-connected agents — test prompts that drive an agentic loop against your live MCP server, not a stub
  • RAG pipelines — assert that retrieved context is used correctly in the final response
  • Multi-step workflows — validate that each prompt stage produces output fit for the next stage

Installation

pip install prompt-tester

Setup

API keys

ANTHROPIC_API_KEY=<your key>   # required for Anthropic models
GOOGLE_API_KEY=<your key>      # required for Gemini models

Keys are loaded automatically via python-dotenv from a .env file in your project root, or from environment variables directly.

Judge model

Call configure() once before your tests run — at the top of a test module, or wherever fits your project's test setup. Both parameters are required:

import prompt_tester
from prompt_tester import Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)

If configure() has not been called, @prompt_run will raise a ConfigurationError with instructions when the test runs.

Function Description
configure(judge_model, judge_provider) Set the judge model — both params required
reset() Clear configuration

Why multiple runs matter

LLM outputs are non-deterministic. A single test run is a point-in-time sample, not a reliable signal — the model may have produced the right answer by chance, or failed due to noise. Running a prompt N times and requiring a minimum pass rate gives you a statistically meaningful signal.

Use the runs and pass_threshold parameters on @prompt_run — each run executes in its own thread, and the decorator asserts the pass rate when all runs are done:

from pathlib import Path
import prompt_tester
from prompt_tester import prompt_run, Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)

PROMPT = Path("prompts/compactor.md").read_text()
INPUT  = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
    runs             = 5,
    pass_threshold   = 0.8,   # at least 4 of 5 runs must pass
)
def test_compactor(run):
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

    alice, budget, deadline = run.ask_all([
        "Is Alice mentioned in the output?",
        "Is the $2M budget mentioned?",
        "Is the Q3 deadline mentioned?",
    ])
    assert alice.passed,    alice.reasoning
    assert budget.passed,   budget.reasoning
    assert deadline.passed, deadline.reasoning
  • Each run gets its own thread — all 5 fire concurrently, so wall-clock time is roughly one run's worth.
  • Failed runs are printed with their run number before the final assertion so you can see exactly which ones failed and why.
  • pass_threshold uses math.ceil internally — with 5 runs and 0.8, you need at least 4 passes.
  • Tuning: raise pass_threshold toward 1.0 for hard requirements; lower it for prompts with known variance. Start at 0.8 and tighten once you have a baseline.

pytest.mark.parametrize alternative

If you want pytest to report each run as a separate test item in the output, use pytest.mark.parametrize with runs=1 (the default):

@pytest.mark.parametrize("_", range(5))
@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
)
def test_compactor_always_concise(run, _):
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

This gives you individual pass/fail per run in pytest's output but no pass-rate control — one failure fails the whole parametrized group. Use it when each run has different inputs, or when you want strict all-or-nothing behaviour.


@prompt_run

Runs a prompt and injects the result into your test function as a PromptRun. Assert on anything — raw output, token counts, costs, or judge verdicts. With runs > 1 each run executes in its own thread and the decorator handles the pass-rate assertion.

from pathlib import Path
from prompt_tester import prompt_run, Model, Provider

PROMPT = Path("prompts/compactor.md").read_text()
INPUT  = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
    runs             = 5,
    pass_threshold   = 0.8,
)
def test_compactor_is_concise(run):
    # Assert on raw output
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

    # Ask the judge a single yes/no question (one API call)
    alice = run.ask("Is Alice mentioned in the output?")
    assert alice.passed, alice.reasoning

    # Ask multiple questions in one API call
    budget, deadline = run.ask_all([
        "Is the $2M budget mentioned?",
        "Is the Q3 deadline mentioned?",
    ])
    assert budget.passed,   budget.reasoning
    assert deadline.passed, deadline.reasoning

    # Assert on API metadata
    assert run.cost_usd is not None
    assert run.stop_reason in ("end_turn", "STOP")

Parameters

Parameter Type Default Description
target_prompt str required Prompt text. Use {key} placeholders for template_vars.
subject_model Model | str required Model to run the prompt against.
subject_provider Provider | str required Provider for subject_model: Provider.ANTHROPIC or Provider.GOOGLE.
template_vars dict {} Values substituted into {key} placeholders before the call.
max_tokens int 2048 Maximum output tokens.
runs int 1 Number of times to run the prompt. Each run executes in its own thread.
pass_threshold float 1.0 Fraction of runs that must pass (runs > 1 only). 0.8 = 80% must pass. Uses math.ceil so 5 runs × 0.8 requires 4 passes.
run_fn callable | None None Custom executor for tool use, MCP, or agentic loops. See below.

Tool use and MCP agents

@prompt_run sends a single prompt and records the response. If your prompt drives an agentic loop — calling tools, querying an MCP server, or doing multiple model turns before producing a final answer — use the run_fn parameter.

run_fn replaces the built-in provider call entirely. It receives the rendered prompt and is responsible for running the full loop. When it returns, run.output holds the final model text and the judge evaluates that output via run.ask() as normal.

When run_fn is set, subject_provider is metadata only — it is recorded on the PromptRun for observability but does not control which SDK is invoked. Your run_fn owns the actual API calls.

(prompt: str, model: str, max_tokens: int) -> CompletionResult

The point is not just "did a tool get called" — it is asserting on the quality of the final answer that emerged from the whole agentic process. Tool call checks are one assertion among many; the judge evaluates the end result.

Example — Anthropic + MCP tools

import anthropic
from prompt_tester import prompt_run, Model, Provider
from prompt_tester import CompletionResult

MCP_TOOLS = [...]   # tool schemas from your MCP server

def run_with_mcp(prompt: str, model: str, max_tokens: int) -> CompletionResult:
    client   = anthropic.Anthropic()
    messages = [{"role": "user", "content": prompt}]

    total_input = total_output = 0

    while True:
        response = client.messages.create(
            model      = model,
            max_tokens = max_tokens,
            tools      = MCP_TOOLS,
            messages   = messages,
        )
        total_input  += response.usage.input_tokens
        total_output += response.usage.output_tokens

        if response.stop_reason != "tool_use":
            final_text = next(
                b.text for b in response.content if hasattr(b, "text")
            )
            return CompletionResult(
                text          = final_text,
                input_tokens  = total_input,
                output_tokens = total_output,
                stop_reason   = response.stop_reason,
                model_used    = response.model,
                request_id    = response.id,
            )

        # Execute tool calls and feed results back
        # mcp_client = your MCP client connection (e.g. via mcp.ClientSession)
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = mcp_client.call_tool(block.name, block.input)  # noqa: F821
                tool_results.append({
                    "type":        "tool_result",
                    "tool_use_id": block.id,
                    "content":     result.content,
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user",      "content": tool_results})


PROMPT = "Use the search tool to find the capital of France, then summarise what you found."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.CLAUDE_SONNET_4_6,
    subject_provider = Provider.ANTHROPIC,
    run_fn           = run_with_mcp,
)
def test_agent_answers_correctly(run):
    # The judge evaluates the final answer — after all tool calls have completed.
    correct, cited = run.ask_all([
        "Does the response state that Paris is the capital of France?",
        "Does the response mention a source or search result?",
    ])
    assert correct.passed, correct.reasoning
    assert cited.passed,   cited.reasoning

    assert run.stop_reason == "end_turn"
    assert run.input_tokens > 0

What run_fn receives and must return

Type Description
prompt str Rendered prompt — template_vars already substituted
model str subject_model value (string)
max_tokens int max_tokens from the decorator
return CompletionResult Import from prompt_tester

CompletionResult only requires text, input_tokens, and output_tokens. All other fields (stop_reason, model_used, request_id, etc.) are optional and appear on the PromptRun for metadata and assertions.


PromptRun fields

Input

Field Type Description
prompt str Rendered prompt text — template_vars placeholders already substituted.
template_vars dict[str, Any] Key/value pairs substituted into the prompt.

Model response

Field Type Description
output str The model's response text.
stop_reason str | None Why generation stopped. Anthropic: "end_turn", "max_tokens", "stop_sequence". Gemini: "STOP", "MAX_TOKENS", "SAFETY", "RECITATION", "OTHER".
safety_filtered bool True if the provider blocked the response. Output will be empty.

Model identity

Field Type Description
model str The subject_model value you passed.
provider str The subject_provider value you passed.
model_used str | None Actual model ID reported by the provider. May differ if the provider resolves an alias.
model_version str | None Provider version string. Populated by Gemini; None for Anthropic.
request_id str | None Provider request ID for log correlation.

Token usage

Field Type Description
input_tokens int Tokens in the prompt.
output_tokens int Tokens in the response.
cached_input_tokens int Input tokens served from cache. 0 when not used.
cache_creation_tokens int Tokens written to cache (Anthropic only).
thoughts_tokens int Reasoning tokens (Gemini thinking models only).

Cost

Field Type Description
cost_usd float | None Total cost in USD. None if the model is not in the pricing table.
run.to_dict()   # all fields as a plain dict — safe to log or serialise

Judge verdicts — ask(), ask_all(), ask_parallel()

# One question — one API call
verdict = run.ask("Is Alice mentioned?")
assert verdict.passed, verdict.reasoning

# Multiple questions in one API call (cheapest, slight cross-contamination risk)
alice, budget = run.ask_all([
    "Is Alice mentioned?",
    "Is the $2M budget mentioned?",
])
assert alice.passed,  alice.reasoning
assert budget.passed, budget.reasoning

# Multiple questions fired in parallel — one API call each, concurrently
alice, budget = run.ask_parallel([
    "Is Alice mentioned?",
    "Is the $2M budget mentioned?",
])
assert alice.passed,  alice.reasoning
assert budget.passed, budget.reasoning
ask(q) ask_all(qs) ask_parallel(qs)
API calls 1 1 for all 1 per question, concurrent
Wall-clock time fastest same as 1 call
Cost lowest higher (N calls)
Cross-contamination none low none
Return type JudgeVerdict list[JudgeVerdict] list[JudgeVerdict]

When to use which:

  • ask_all — many independent checks where cost matters and cross-contamination is acceptable.
  • ask_parallel — questions that must be fully isolated from each other, without the latency of sequential calls.
  • ask — a single question, or when you need to branch on the result before asking the next.

JudgeVerdict fields

Field Type Description
question str The question you asked.
answer str "yes" or "no".
passed bool True when the answer is "yes".
snippet str | None Verbatim excerpt the judge cited as evidence.
reasoning str | None The judge's stated reasoning.
judge_model str Model ID of the judge.
judge_provider str | None Provider of the judge.
judge_input_tokens int Input tokens for this verdict.
judge_output_tokens int Output tokens for this verdict.
judge_cost_usd float | None Cost for this verdict.
verdict.to_dict()   # all fields as a plain dict

Judge configuration

Call configure() once before your tests run. Both parameters are required — there is no default judge:

import prompt_tester
from prompt_tester import Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)

Cost tracking

Token counts and USD cost are available on every PromptRun:

run.input_tokens / run.output_tokens / run.cost_usd    # prompt call
verdict.judge_input_tokens / verdict.judge_cost_usd    # judge call

Prices are defined in prompt_tester/config.py in the _PRICES table, keyed by Model enum. To add a model not in the table, add a Model member and a corresponding _PRICES entry.


Architecture

prompt_tester/
  __init__.py   Public API: prompt_run, PromptRun, JudgeVerdict, configure, reset
  config.py     Model/Provider enums, pricing table, configure() / reset()
  decorator.py  @prompt_run — runs prompt, injects PromptRun into test function
  models.py     Pydantic types: PromptRun, JudgeVerdict, UsageMetrics
  judge.py      Judge.evaluate() and evaluate_all() — LLM-as-judge
  llm/
    providers/
      base.py        LLMProvider ABC + CompletionResult (unified return type)
      anthropic.py   Anthropic SDK wrapper
      gemini.py      Google GenAI SDK wrapper

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_tester-0.1.0.tar.gz (34.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prompt_tester-0.1.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file prompt_tester-0.1.0.tar.gz.

File metadata

  • Download URL: prompt_tester-0.1.0.tar.gz
  • Upload date:
  • Size: 34.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for prompt_tester-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6a6950d87389aa47891f6ef5cf666fa2a62f7771b50e851c7792273391a2eb4e
MD5 7a3f3b39f5ea5bc1ed90e4e604383d8d
BLAKE2b-256 3dcf4685375fc2e3dd2d6ceda2a03c58635d8ffe6d8bb6d366c9d72649834d35

See more details on using hashes here.

File details

Details for the file prompt_tester-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: prompt_tester-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for prompt_tester-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c9ff19203e3b637ad147d31e110c4e18f43f9c4bae5026dd157ac2e1b3043c1
MD5 bf729657498b6d0e10d398698bbd0f0d
BLAKE2b-256 95c97e3b324dc274c4b6ed9ef9cfce135061870e0859f49f7b1abc4857030dc6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page