Decorator-based integration testing for LLM prompts
Project description
prompt-tester
A Python library for testing LLM prompts with statistical reliability using the @prompt_run decorator.
When to use this
prompt-tester is designed for integration tests — tests that run against a real model with real API calls. It is not a mocking or unit-testing framework for LLM code.
The sweet spot is testing prompts that are part of a larger system:
- Custom tools and function calling — verify the model actually calls the right tool with the right arguments under realistic conditions
- MCP-connected agents — test prompts that drive an agentic loop against your live MCP server, not a stub
- RAG pipelines — assert that retrieved context is used correctly in the final response
- Multi-step workflows — validate that each prompt stage produces output fit for the next stage
Installation
pip install prompt-tester
Setup
API keys
ANTHROPIC_API_KEY=<your key> # required for Anthropic models
GOOGLE_API_KEY=<your key> # required for Gemini models
Keys are loaded automatically via python-dotenv from a .env file in your project root, or from environment variables directly.
Judge model
Call configure() once before your tests run — at the top of a test module, or wherever fits your project's test setup. Both parameters are required:
import prompt_tester
from prompt_tester import Model, Provider
prompt_tester.configure(
judge_model = Model.GEMINI_2_5_FLASH,
judge_provider = Provider.GOOGLE,
)
If configure() has not been called, @prompt_run will raise a ConfigurationError with instructions when the test runs.
| Function | Description |
|---|---|
configure(judge_model, judge_provider) |
Set the judge model — both params required |
reset() |
Clear configuration |
Why multiple runs matter
LLM outputs are non-deterministic. A single test run is a point-in-time sample, not a reliable signal — the model may have produced the right answer by chance, or failed due to noise. Running a prompt N times and requiring a minimum pass rate gives you a statistically meaningful signal.
Use the runs and pass_threshold parameters on @prompt_run — each run executes in its own thread, and the decorator asserts the pass rate when all runs are done:
from pathlib import Path
import prompt_tester
from prompt_tester import prompt_run, Model, Provider
prompt_tester.configure(
judge_model = Model.GEMINI_2_5_FLASH,
judge_provider = Provider.GOOGLE,
)
PROMPT = Path("prompts/compactor.md").read_text()
INPUT = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."
@prompt_run(
target_prompt = PROMPT,
subject_model = Model.GEMINI_3_1_FLASH_LITE,
subject_provider = Provider.GOOGLE,
template_vars = {"text": INPUT},
runs = 5,
pass_threshold = 0.8, # at least 4 of 5 runs must pass
)
def test_compactor(run):
assert len(run.output) < len(run.template_vars["text"]) * 0.6
alice, budget, deadline = run.ask_all([
"Is Alice mentioned in the output?",
"Is the $2M budget mentioned?",
"Is the Q3 deadline mentioned?",
])
assert alice.passed, alice.reasoning
assert budget.passed, budget.reasoning
assert deadline.passed, deadline.reasoning
- Each run gets its own thread — all 5 fire concurrently, so wall-clock time is roughly one run's worth.
- Failed runs are printed with their run number before the final assertion so you can see exactly which ones failed and why.
pass_thresholdusesmath.ceilinternally — with 5 runs and 0.8, you need at least 4 passes.- Tuning: raise
pass_thresholdtoward1.0for hard requirements; lower it for prompts with known variance. Start at0.8and tighten once you have a baseline.
pytest.mark.parametrize alternative
If you want pytest to report each run as a separate test item in the output, use pytest.mark.parametrize with runs=1 (the default):
@pytest.mark.parametrize("_", range(5))
@prompt_run(
target_prompt = PROMPT,
subject_model = Model.GEMINI_3_1_FLASH_LITE,
subject_provider = Provider.GOOGLE,
template_vars = {"text": INPUT},
)
def test_compactor_always_concise(run, _):
assert len(run.output) < len(run.template_vars["text"]) * 0.6
This gives you individual pass/fail per run in pytest's output but no pass-rate control — one failure fails the whole parametrized group. Use it when each run has different inputs, or when you want strict all-or-nothing behaviour.
@prompt_run
Runs a prompt and injects the result into your test function as a PromptRun. Assert on anything — raw output, token counts, costs, or judge verdicts. With runs > 1 each run executes in its own thread and the decorator handles the pass-rate assertion.
from pathlib import Path
from prompt_tester import prompt_run, Model, Provider
PROMPT = Path("prompts/compactor.md").read_text()
INPUT = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."
@prompt_run(
target_prompt = PROMPT,
subject_model = Model.GEMINI_3_1_FLASH_LITE,
subject_provider = Provider.GOOGLE,
template_vars = {"text": INPUT},
runs = 5,
pass_threshold = 0.8,
)
def test_compactor_is_concise(run):
# Assert on raw output
assert len(run.output) < len(run.template_vars["text"]) * 0.6
# Ask the judge a single yes/no question (one API call)
alice = run.ask("Is Alice mentioned in the output?")
assert alice.passed, alice.reasoning
# Ask multiple questions in one API call
budget, deadline = run.ask_all([
"Is the $2M budget mentioned?",
"Is the Q3 deadline mentioned?",
])
assert budget.passed, budget.reasoning
assert deadline.passed, deadline.reasoning
# Assert on API metadata
assert run.cost_usd is not None
assert run.stop_reason in ("end_turn", "STOP")
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
target_prompt |
str |
required | Prompt text. Use {key} placeholders for template_vars. |
subject_model |
Model | str |
required | Model to run the prompt against. |
subject_provider |
Provider | str |
required | Provider for subject_model: Provider.ANTHROPIC or Provider.GOOGLE. |
template_vars |
dict |
{} |
Values substituted into {key} placeholders before the call. |
max_tokens |
int |
2048 |
Maximum output tokens. |
runs |
int |
1 |
Number of times to run the prompt. Each run executes in its own thread. |
pass_threshold |
float |
1.0 |
Fraction of runs that must pass (runs > 1 only). 0.8 = 80% must pass. Uses math.ceil so 5 runs × 0.8 requires 4 passes. |
run_fn |
callable | None |
None |
Custom executor for tool use, MCP, or agentic loops. See below. |
Tool use and MCP agents
@prompt_run sends a single prompt and records the response. If your prompt drives an agentic loop — calling tools, querying an MCP server, or doing multiple model turns before producing a final answer — use the run_fn parameter.
run_fn replaces the built-in provider call entirely. It receives the rendered prompt and is responsible for running the full loop. When it returns, run.output holds the final model text and the judge evaluates that output via run.ask() as normal.
When run_fn is set, subject_provider is metadata only — it is recorded on the PromptRun for observability but does not control which SDK is invoked. Your run_fn owns the actual API calls.
(prompt: str, model: str, max_tokens: int) -> CompletionResult
The point is not just "did a tool get called" — it is asserting on the quality of the final answer that emerged from the whole agentic process. Tool call checks are one assertion among many; the judge evaluates the end result.
Example — Anthropic + MCP tools
import anthropic
from prompt_tester import prompt_run, Model, Provider
from prompt_tester import CompletionResult
MCP_TOOLS = [...] # tool schemas from your MCP server
def run_with_mcp(prompt: str, model: str, max_tokens: int) -> CompletionResult:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": prompt}]
total_input = total_output = 0
while True:
response = client.messages.create(
model = model,
max_tokens = max_tokens,
tools = MCP_TOOLS,
messages = messages,
)
total_input += response.usage.input_tokens
total_output += response.usage.output_tokens
if response.stop_reason != "tool_use":
final_text = next(
b.text for b in response.content if hasattr(b, "text")
)
return CompletionResult(
text = final_text,
input_tokens = total_input,
output_tokens = total_output,
stop_reason = response.stop_reason,
model_used = response.model,
request_id = response.id,
)
# Execute tool calls and feed results back
# mcp_client = your MCP client connection (e.g. via mcp.ClientSession)
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = mcp_client.call_tool(block.name, block.input) # noqa: F821
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result.content,
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
PROMPT = "Use the search tool to find the capital of France, then summarise what you found."
@prompt_run(
target_prompt = PROMPT,
subject_model = Model.CLAUDE_SONNET_4_6,
subject_provider = Provider.ANTHROPIC,
run_fn = run_with_mcp,
)
def test_agent_answers_correctly(run):
# The judge evaluates the final answer — after all tool calls have completed.
correct, cited = run.ask_all([
"Does the response state that Paris is the capital of France?",
"Does the response mention a source or search result?",
])
assert correct.passed, correct.reasoning
assert cited.passed, cited.reasoning
assert run.stop_reason == "end_turn"
assert run.input_tokens > 0
What run_fn receives and must return
| Type | Description | |
|---|---|---|
prompt |
str |
Rendered prompt — template_vars already substituted |
model |
str |
subject_model value (string) |
max_tokens |
int |
max_tokens from the decorator |
| return | CompletionResult |
Import from prompt_tester |
CompletionResult only requires text, input_tokens, and output_tokens. All other fields (stop_reason, model_used, request_id, etc.) are optional and appear on the PromptRun for metadata and assertions.
PromptRun fields
Input
| Field | Type | Description |
|---|---|---|
prompt |
str |
Rendered prompt text — template_vars placeholders already substituted. |
template_vars |
dict[str, Any] |
Key/value pairs substituted into the prompt. |
Model response
| Field | Type | Description |
|---|---|---|
output |
str |
The model's response text. |
stop_reason |
str | None |
Why generation stopped. Anthropic: "end_turn", "max_tokens", "stop_sequence". Gemini: "STOP", "MAX_TOKENS", "SAFETY", "RECITATION", "OTHER". |
safety_filtered |
bool |
True if the provider blocked the response. Output will be empty. |
Model identity
| Field | Type | Description |
|---|---|---|
model |
str |
The subject_model value you passed. |
provider |
str |
The subject_provider value you passed. |
model_used |
str | None |
Actual model ID reported by the provider. May differ if the provider resolves an alias. |
model_version |
str | None |
Provider version string. Populated by Gemini; None for Anthropic. |
request_id |
str | None |
Provider request ID for log correlation. |
Token usage
| Field | Type | Description |
|---|---|---|
input_tokens |
int |
Tokens in the prompt. |
output_tokens |
int |
Tokens in the response. |
cached_input_tokens |
int |
Input tokens served from cache. 0 when not used. |
cache_creation_tokens |
int |
Tokens written to cache (Anthropic only). |
thoughts_tokens |
int |
Reasoning tokens (Gemini thinking models only). |
Cost
| Field | Type | Description |
|---|---|---|
cost_usd |
float | None |
Total cost in USD. None if the model is not in the pricing table. |
run.to_dict() # all fields as a plain dict — safe to log or serialise
Judge verdicts — ask(), ask_all(), ask_parallel()
# One question — one API call
verdict = run.ask("Is Alice mentioned?")
assert verdict.passed, verdict.reasoning
# Multiple questions in one API call (cheapest, slight cross-contamination risk)
alice, budget = run.ask_all([
"Is Alice mentioned?",
"Is the $2M budget mentioned?",
])
assert alice.passed, alice.reasoning
assert budget.passed, budget.reasoning
# Multiple questions fired in parallel — one API call each, concurrently
alice, budget = run.ask_parallel([
"Is Alice mentioned?",
"Is the $2M budget mentioned?",
])
assert alice.passed, alice.reasoning
assert budget.passed, budget.reasoning
ask(q) |
ask_all(qs) |
ask_parallel(qs) |
|
|---|---|---|---|
| API calls | 1 | 1 for all | 1 per question, concurrent |
| Wall-clock time | — | fastest | same as 1 call |
| Cost | — | lowest | higher (N calls) |
| Cross-contamination | none | low | none |
| Return type | JudgeVerdict |
list[JudgeVerdict] |
list[JudgeVerdict] |
When to use which:
ask_all— many independent checks where cost matters and cross-contamination is acceptable.ask_parallel— questions that must be fully isolated from each other, without the latency of sequential calls.ask— a single question, or when you need to branch on the result before asking the next.
JudgeVerdict fields
| Field | Type | Description |
|---|---|---|
question |
str |
The question you asked. |
answer |
str |
"yes" or "no". |
passed |
bool |
True when the answer is "yes". |
snippet |
str | None |
Verbatim excerpt the judge cited as evidence. |
reasoning |
str | None |
The judge's stated reasoning. |
judge_model |
str |
Model ID of the judge. |
judge_provider |
str | None |
Provider of the judge. |
judge_input_tokens |
int |
Input tokens for this verdict. |
judge_output_tokens |
int |
Output tokens for this verdict. |
judge_cost_usd |
float | None |
Cost for this verdict. |
verdict.to_dict() # all fields as a plain dict
Judge configuration
Call configure() once before your tests run. Both parameters are required — there is no default judge:
import prompt_tester
from prompt_tester import Model, Provider
prompt_tester.configure(
judge_model = Model.GEMINI_2_5_FLASH,
judge_provider = Provider.GOOGLE,
)
Cost tracking
Token counts and USD cost are available on every PromptRun:
run.input_tokens / run.output_tokens / run.cost_usd # prompt call
verdict.judge_input_tokens / verdict.judge_cost_usd # judge call
Prices are defined in prompt_tester/config.py in the _PRICES table, keyed by Model enum. To add a model not in the table, add a Model member and a corresponding _PRICES entry.
Architecture
prompt_tester/
__init__.py Public API: prompt_run, PromptRun, JudgeVerdict, configure, reset
config.py Model/Provider enums, pricing table, configure() / reset()
decorator.py @prompt_run — runs prompt, injects PromptRun into test function
models.py Pydantic types: PromptRun, JudgeVerdict, UsageMetrics
judge.py Judge.evaluate() and evaluate_all() — LLM-as-judge
llm/
providers/
base.py LLMProvider ABC + CompletionResult (unified return type)
anthropic.py Anthropic SDK wrapper
gemini.py Google GenAI SDK wrapper
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prompt_tester-0.1.0.tar.gz.
File metadata
- Download URL: prompt_tester-0.1.0.tar.gz
- Upload date:
- Size: 34.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a6950d87389aa47891f6ef5cf666fa2a62f7771b50e851c7792273391a2eb4e
|
|
| MD5 |
7a3f3b39f5ea5bc1ed90e4e604383d8d
|
|
| BLAKE2b-256 |
3dcf4685375fc2e3dd2d6ceda2a03c58635d8ffe6d8bb6d366c9d72649834d35
|
File details
Details for the file prompt_tester-0.1.0-py3-none-any.whl.
File metadata
- Download URL: prompt_tester-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c9ff19203e3b637ad147d31e110c4e18f43f9c4bae5026dd157ac2e1b3043c1
|
|
| MD5 |
bf729657498b6d0e10d398698bbd0f0d
|
|
| BLAKE2b-256 |
95c97e3b324dc274c4b6ed9ef9cfce135061870e0859f49f7b1abc4857030dc6
|