Decorator-based integration testing for LLM prompts

These details have not been verified by PyPI

Project links

Project description

prompt-tester

A Python library for testing LLM prompts with statistical reliability using the @prompt_run decorator.

When to use this

prompt-tester is designed for integration tests — tests that run against a real model with real API calls. It is not a mocking or unit-testing framework for LLM code.

The sweet spot is testing prompts that are part of a larger system:

Custom tools and function calling — verify the model actually calls the right tool with the right arguments under realistic conditions
MCP-connected agents — test prompts that drive an agentic loop against your live MCP server, not a stub
RAG pipelines — assert that retrieved context is used correctly in the final response
Multi-step workflows — validate that each prompt stage produces output fit for the next stage

Installation

pip install prompt-tester

Setup

API keys

ANTHROPIC_API_KEY=<your key>   # required for Anthropic models
GOOGLE_API_KEY=<your key>      # required for Gemini models

Keys are loaded automatically via python-dotenv from a .env file in your project root, or from environment variables directly.

Judge model

Call configure() once before your tests run — at the top of a test module, or wherever fits your project's test setup. Both parameters are required:

import prompt_tester
from prompt_tester import Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)

If configure() has not been called, @prompt_run will raise a ConfigurationError with instructions when the test runs.

Function	Description
`configure(judge_model, judge_provider)`	Set the judge model — both params required
`reset()`	Clear configuration

Why multiple runs matter

LLM outputs are non-deterministic. A single test run is a point-in-time sample, not a reliable signal — the model may have produced the right answer by chance, or failed due to noise. Running a prompt N times and requiring a minimum pass rate gives you a statistically meaningful signal.

Use the runs and pass_threshold parameters on @prompt_run — each run executes in its own thread, and the decorator asserts the pass rate when all runs are done:

from pathlib import Path
import prompt_tester
from prompt_tester import prompt_run, Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)

PROMPT = Path("prompts/compactor.md").read_text()
INPUT  = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
    runs             = 5,
    pass_threshold   = 0.8,   # at least 4 of 5 runs must pass
)
def test_compactor(run):
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

    alice, budget, deadline = run.ask_all([
        "Is Alice mentioned in the output?",
        "Is the $2M budget mentioned?",
        "Is the Q3 deadline mentioned?",
    ])
    assert alice.passed,    alice.reasoning
    assert budget.passed,   budget.reasoning
    assert deadline.passed, deadline.reasoning

Each run gets its own thread — all 5 fire concurrently, so wall-clock time is roughly one run's worth.
Failed runs are printed with their run number before the final assertion so you can see exactly which ones failed and why.
pass_threshold uses math.ceil internally — with 5 runs and 0.8, you need at least 4 passes.
Tuning: raise pass_threshold toward 1.0 for hard requirements; lower it for prompts with known variance. Start at 0.8 and tighten once you have a baseline.

pytest.mark.parametrize alternative

If you want pytest to report each run as a separate test item in the output, use pytest.mark.parametrize with runs=1 (the default):

@pytest.mark.parametrize("_", range(5))
@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
)
def test_compactor_always_concise(run, _):
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

This gives you individual pass/fail per run in pytest's output but no pass-rate control — one failure fails the whole parametrized group. Use it when each run has different inputs, or when you want strict all-or-nothing behaviour.

`@prompt_run`

Runs a prompt and injects the result into your test function as a PromptRun. Assert on anything — raw output, token counts, costs, or judge verdicts. With runs > 1 each run executes in its own thread and the decorator handles the pass-rate assertion.

from pathlib import Path
from prompt_tester import prompt_run, Model, Provider

PROMPT = Path("prompts/compactor.md").read_text()
INPUT  = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
    runs             = 5,
    pass_threshold   = 0.8,
)
def test_compactor_is_concise(run):
    # Assert on raw output
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

    # Ask the judge a single yes/no question (one API call)
    alice = run.ask("Is Alice mentioned in the output?")
    assert alice.passed, alice.reasoning

    # Ask multiple questions in one API call
    budget, deadline = run.ask_all([
        "Is the $2M budget mentioned?",
        "Is the Q3 deadline mentioned?",
    ])
    assert budget.passed,   budget.reasoning
    assert deadline.passed, deadline.reasoning

    # Assert on API metadata
    assert run.cost_usd is not None
    assert run.stop_reason in ("end_turn", "STOP")

Parameters

Parameter	Type	Default	Description
`target_prompt`	`str`	required	Prompt text. Use `{key}` placeholders for `template_vars`.
`subject_model`	`Model \| str`	required	Model to run the prompt against.
`subject_provider`	`Provider \| str`	required	Provider for `subject_model`: `Provider.ANTHROPIC` or `Provider.GOOGLE`.
`template_vars`	`dict`	`{}`	Values substituted into `{key}` placeholders before the call.
`max_tokens`	`int`	`2048`	Maximum output tokens.
`runs`	`int`	`1`	Number of times to run the prompt. Each run executes in its own thread.
`pass_threshold`	`float`	`1.0`	Fraction of runs that must pass (`runs > 1` only). `0.8` = 80% must pass. Uses `math.ceil` so 5 runs × 0.8 requires 4 passes.
`run_fn`	`callable \| None`	`None`	Custom executor for tool use, MCP, or agentic loops. See below.

Tool use and MCP agents

@prompt_run sends a single prompt and records the response. If your prompt drives an agentic loop — calling tools, querying an MCP server, or doing multiple model turns before producing a final answer — use the run_fn parameter.

run_fn replaces the built-in provider call entirely. It receives the rendered prompt and is responsible for running the full loop. When it returns, run.output holds the final model text and the judge evaluates that output via run.ask() as normal.

When run_fn is set, subject_provider is metadata only — it is recorded on the PromptRun for observability but does not control which SDK is invoked. Your run_fn owns the actual API calls.

(prompt: str, model: str, max_tokens: int) -> CompletionResult

The point is not just "did a tool get called" — it is asserting on the quality of the final answer that emerged from the whole agentic process. Tool call checks are one assertion among many; the judge evaluates the end result.

Example — Anthropic + MCP tools

import anthropic
from prompt_tester import prompt_run, Model, Provider
from prompt_tester import CompletionResult

MCP_TOOLS = [...]   # tool schemas from your MCP server

def run_with_mcp(prompt: str, model: str, max_tokens: int) -> CompletionResult:
    client   = anthropic.Anthropic()
    messages = [{"role": "user", "content": prompt}]

    total_input = total_output = 0

    while True:
        response = client.messages.create(
            model      = model,
            max_tokens = max_tokens,
            tools      = MCP_TOOLS,
            messages   = messages,
        )
        total_input  += response.usage.input_tokens
        total_output += response.usage.output_tokens

        if response.stop_reason != "tool_use":
            final_text = next(
                b.text for b in response.content if hasattr(b, "text")
            )
            return CompletionResult(
                text          = final_text,
                input_tokens  = total_input,
                output_tokens = total_output,
                stop_reason   = response.stop_reason,
                model_used    = response.model,
                request_id    = response.id,
            )

        # Execute tool calls and feed results back
        # mcp_client = your MCP client connection (e.g. via mcp.ClientSession)
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = mcp_client.call_tool(block.name, block.input)  # noqa: F821
                tool_results.append({
                    "type":        "tool_result",
                    "tool_use_id": block.id,
                    "content":     result.content,
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user",      "content": tool_results})


PROMPT = "Use the search tool to find the capital of France, then summarise what you found."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.CLAUDE_SONNET_4_6,
    subject_provider = Provider.ANTHROPIC,
    run_fn           = run_with_mcp,
)
def test_agent_answers_correctly(run):
    # The judge evaluates the final answer — after all tool calls have completed.
    correct, cited = run.ask_all([
        "Does the response state that Paris is the capital of France?",
        "Does the response mention a source or search result?",
    ])
    assert correct.passed, correct.reasoning
    assert cited.passed,   cited.reasoning

    assert run.stop_reason == "end_turn"
    assert run.input_tokens > 0

What `run_fn` receives and must return

	Type	Description
`prompt`	`str`	Rendered prompt — `template_vars` already substituted
`model`	`str`	`subject_model` value (string)
`max_tokens`	`int`	`max_tokens` from the decorator
return	`CompletionResult`	Import from `prompt_tester`

CompletionResult only requires text, input_tokens, and output_tokens. All other fields (stop_reason, model_used, request_id, etc.) are optional and appear on the PromptRun for metadata and assertions.

`PromptRun` fields

Input

Field	Type	Description
`prompt`	`str`	Rendered prompt text — `template_vars` placeholders already substituted.
`template_vars`	`dict[str, Any]`	Key/value pairs substituted into the prompt.

Model response

Field	Type	Description
`output`	`str`	The model's response text.
`stop_reason`	`str \| None`	Why generation stopped. Anthropic: `"end_turn"`, `"max_tokens"`, `"stop_sequence"`. Gemini: `"STOP"`, `"MAX_TOKENS"`, `"SAFETY"`, `"RECITATION"`, `"OTHER"`.
`safety_filtered`	`bool`	`True` if the provider blocked the response. Output will be empty.

Model identity

Field	Type	Description
`model`	`str`	The `subject_model` value you passed.
`provider`	`str`	The `subject_provider` value you passed.
`model_used`	`str \| None`	Actual model ID reported by the provider. May differ if the provider resolves an alias.
`model_version`	`str \| None`	Provider version string. Populated by Gemini; `None` for Anthropic.
`request_id`	`str \| None`	Provider request ID for log correlation.

Token usage

Field	Type	Description
`input_tokens`	`int`	Tokens in the prompt.
`output_tokens`	`int`	Tokens in the response.
`cached_input_tokens`	`int`	Input tokens served from cache. `0` when not used.
`cache_creation_tokens`	`int`	Tokens written to cache (Anthropic only).
`thoughts_tokens`	`int`	Reasoning tokens (Gemini thinking models only).

Cost

Field	Type	Description
`cost_usd`	`float \| None`	Total cost in USD. `None` if the model is not in the pricing table.

run.to_dict()   # all fields as a plain dict — safe to log or serialise

Judge verdicts — `ask()`, `ask_all()`, `ask_parallel()`

# One question — one API call
verdict = run.ask("Is Alice mentioned?")
assert verdict.passed, verdict.reasoning

# Multiple questions in one API call (cheapest, slight cross-contamination risk)
alice, budget = run.ask_all([
    "Is Alice mentioned?",
    "Is the $2M budget mentioned?",
])
assert alice.passed,  alice.reasoning
assert budget.passed, budget.reasoning

# Multiple questions fired in parallel — one API call each, concurrently
alice, budget = run.ask_parallel([
    "Is Alice mentioned?",
    "Is the $2M budget mentioned?",
])
assert alice.passed,  alice.reasoning
assert budget.passed, budget.reasoning

	`ask(q)`	`ask_all(qs)`	`ask_parallel(qs)`
API calls	1	1 for all	1 per question, concurrent
Wall-clock time	—	fastest	same as 1 call
Cost	—	lowest	higher (N calls)
Cross-contamination	none	low	none
Return type	`JudgeVerdict`	`list[JudgeVerdict]`	`list[JudgeVerdict]`

When to use which:

ask_all — many independent checks where cost matters and cross-contamination is acceptable.
ask_parallel — questions that must be fully isolated from each other, without the latency of sequential calls.
ask — a single question, or when you need to branch on the result before asking the next.

`JudgeVerdict` fields

Field	Type	Description
`question`	`str`	The question you asked.
`answer`	`str`	`"yes"` or `"no"`.
`passed`	`bool`	`True` when the answer is `"yes"`.
`snippet`	`str \| None`	Verbatim excerpt the judge cited as evidence.
`reasoning`	`str \| None`	The judge's stated reasoning.
`judge_model`	`str`	Model ID of the judge.
`judge_provider`	`str \| None`	Provider of the judge.
`judge_input_tokens`	`int`	Input tokens for this verdict.
`judge_output_tokens`	`int`	Output tokens for this verdict.
`judge_cost_usd`	`float \| None`	Cost for this verdict.

verdict.to_dict()   # all fields as a plain dict

Judge configuration

Call configure() once before your tests run. Both parameters are required — there is no default judge:

import prompt_tester
from prompt_tester import Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)

Cost tracking

Token counts and USD cost are available on every PromptRun:

run.input_tokens / run.output_tokens / run.cost_usd    # prompt call
verdict.judge_input_tokens / verdict.judge_cost_usd    # judge call

Prices are defined in prompt_tester/config.py in the _PRICES table, keyed by Model enum. To add a model not in the table, add a Model member and a corresponding _PRICES entry.

Architecture

prompt_tester/
  __init__.py   Public API: prompt_run, PromptRun, JudgeVerdict, configure, reset
  config.py     Model/Provider enums, pricing table, configure() / reset()
  decorator.py  @prompt_run — runs prompt, injects PromptRun into test function
  models.py     Pydantic types: PromptRun, JudgeVerdict, UsageMetrics
  judge.py      Judge.evaluate() and evaluate_all() — LLM-as-judge
  llm/
    providers/
      base.py        LLMProvider ABC + CompletionResult (unified return type)
      anthropic.py   Anthropic SDK wrapper
      gemini.py      Google GenAI SDK wrapper

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 21, 2026

This version

0.1.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_tester-0.1.0.tar.gz (34.9 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prompt_tester-0.1.0-py3-none-any.whl (25.2 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file prompt_tester-0.1.0.tar.gz.

File metadata

Download URL: prompt_tester-0.1.0.tar.gz
Upload date: May 21, 2026
Size: 34.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for prompt_tester-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6a6950d87389aa47891f6ef5cf666fa2a62f7771b50e851c7792273391a2eb4e`
MD5	`7a3f3b39f5ea5bc1ed90e4e604383d8d`
BLAKE2b-256	`3dcf4685375fc2e3dd2d6ceda2a03c58635d8ffe6d8bb6d366c9d72649834d35`

See more details on using hashes here.

File details

Details for the file prompt_tester-0.1.0-py3-none-any.whl.

File metadata

Download URL: prompt_tester-0.1.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 25.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for prompt_tester-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c9ff19203e3b637ad147d31e110c4e18f43f9c4bae5026dd157ac2e1b3043c1`
MD5	`bf729657498b6d0e10d398698bbd0f0d`
BLAKE2b-256	`95c97e3b324dc274c4b6ed9ef9cfce135061870e0859f49f7b1abc4857030dc6`

See more details on using hashes here.

prompt-tester 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

prompt-tester

When to use this

Installation

Setup

API keys

Judge model

Why multiple runs matter

pytest.mark.parametrize alternative

@prompt_run

Parameters

Tool use and MCP agents

Example — Anthropic + MCP tools

What run_fn receives and must return

PromptRun fields

Judge verdicts — ask(), ask_all(), ask_parallel()

JudgeVerdict fields

Judge configuration

Cost tracking

Architecture

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`@prompt_run`

What `run_fn` receives and must return

`PromptRun` fields

Judge verdicts — `ask()`, `ask_all()`, `ask_parallel()`

`JudgeVerdict` fields