A lightweight, code-first evaluation framework for testing AI agents and LLM applications
Project description
EZVals
Unit Testing for AI agents and LLM apps. Write Python functions, use EvalContext to track results, and EZVals handles storage, scoring, and a small web UI.
Installation
EZVals is intended as a development dependency.
pip install ezvals
# or with uv
uv add --dev ezvals
Quick start
Look at the examples directory for runnable snippets. Run the demo suite and open the UI:
ezvals serve examples
UI highlights
- Expand rows to see inputs, outputs, metadata, scores, and annotations.
- Edit scores or annotations inline; changes persist to JSON.
- Export dropdown: JSON, CSV (raw data), PDF, Markdown (filtered view with charts).
Authoring evals
Write evals like tests. Add a ctx: EvalContext parameter, and EZVals auto-injects a mutable context object.
from ezvals import eval, EvalContext
@eval(input="I want a refund", dataset="customer_service")
async def test_refund(ctx: EvalContext):
ctx.output = await run_agent(ctx.input)
assert "refund" in ctx.output.lower(), "Should acknowledge refund"
EvalContext
EvalContext is a mutable builder for constructing eval results. When your function has a parameter with type annotation : EvalContext, EZVals automatically injects an instance.
Key features:
- Auto-injection: Just add
ctx: EvalContextparameter - Direct assignment: Set
ctx.output,ctx.input,ctx.referencedirectly - Assertion-based scoring: Use
assertstatements like pytest - Auto-return: No explicit return needed
- Exception safety: Partial data preserved on errors
Direct field access:
ctx.input = "test input"
ctx.output = "model response"
ctx.reference = "expected output"
ctx.metadata["model"] = "gpt-4"
Scoring with assertions:
assert ctx.output is not None, "Got no output"
assert "expected" in ctx.output.lower(), "Missing expected content"
Manual scoring (when needed):
ctx.add_score(True, "Test passed") # Boolean
ctx.add_score(0.95, "High score", key="similarity") # Numeric
Writing your first eval
Set context fields in the decorator when possible:
from ezvals import eval, EvalContext
@eval(
input="I want a refund",
reference="I'll help you process your refund request.",
dataset="customer_service",
metadata={"model": "gpt-4"}
)
async def test_refund_request(ctx: EvalContext):
ctx.output = await run_agent(ctx.input)
assert ctx.output == ctx.reference
Common patterns
1) Assertions (preferred):
@eval(input="What is 2+2?", reference="4", dataset="math")
async def test_arithmetic(ctx: EvalContext):
ctx.output = await calculator(ctx.input)
assert ctx.output == ctx.reference
2) Multiple assertions:
@eval(input="Explain quantum computing", dataset="qa")
async def test_explanation(ctx: EvalContext):
ctx.output = await my_agent(ctx.input)
assert len(ctx.output) > 50, "Response too short"
assert "quantum" in ctx.output.lower(), "Should mention quantum"
3) Multiple named scores:
@eval(input="Classify this text", dataset="classification")
async def test_classifier(ctx: EvalContext):
result = await classifier(ctx.input)
ctx.output = result["label"]
ctx.add_score(result["confidence"] > 0.8, "High confidence", key="confidence")
ctx.add_score("positive" in result["label"], "Sentiment detected", key="sentiment")
@eval decorator
Wraps a function and records evaluation results.
Parameters:
input(any): Pre-populate ctx.inputreference(any): Pre-populate ctx.referencedataset(str): Groups related evals (defaults to filename)labels(list): Filtering tagsmetadata(dict): Pre-populate ctx.metadatadefault_score_key(str): Default key foradd_score()timeout(float): Maximum execution time in secondstarget(callable): Pre-hook that runs before the evalevaluators(list): Callables that add scores to a result
Examples:
# Minimal
@eval(input="test")
def test(ctx: EvalContext):
ctx.output = process(ctx.input)
assert ctx.output
# With timeout
@eval(input="complex task", timeout=5.0, dataset="performance")
async def test_with_timeout(ctx: EvalContext):
ctx.output = await slow_agent(ctx.input)
# Target hook to run your agent
def call_agent(ctx: EvalContext):
ctx.output = my_agent(ctx.input)
@eval(input="What is the weather?", target=call_agent, dataset="agent")
def test_with_target(ctx: EvalContext):
assert "weather" in ctx.output.lower()
File-level defaults
Set global properties for all tests in a file using ezvals_defaults:
ezvals_defaults = {
"dataset": "sentiment_analysis",
"labels": ["production", "nlp"],
"metadata": {"model": "gpt-4"}
}
@eval(input="I love this!")
def test_positive(ctx: EvalContext):
ctx.output = analyze(ctx.input)
assert ctx.output == "positive"
@eval(input="This is terrible", labels=["experimental"]) # Override labels
def test_negative(ctx: EvalContext):
ctx.output = analyze(ctx.input)
assert ctx.output == "negative"
Priority: Decorator parameters > File defaults > Built-in defaults
Cases (cases=)
Generate multiple evals from one function with the cases= argument on @eval.
Cases are list-of-dict overrides for the same fields you can pass to @eval (plus id).
@eval(
dataset="sentiment",
cases=[
{"input": "I love this!", "reference": "positive"},
{"input": "This is terrible", "reference": "negative"},
{"input": "It's okay I guess", "reference": "neutral"},
],
)
def test_sentiment(ctx: EvalContext):
ctx.output = analyze_sentiment(ctx.input)
assert ctx.output == ctx.reference
Custom case data:
@eval(
dataset="math",
cases=[
{"input": {"a": 2, "b": 3}, "reference": 5},
{"input": {"a": 4, "b": 7}, "reference": 28},
],
)
def test_calculator(ctx: EvalContext):
ctx.output = ctx.input["a"] + ctx.input["b"]
assert ctx.output == ctx.reference
Explicit grids:
@eval(
dataset="models",
cases=[
{"input": {"model": "gpt-4", "temperature": 0.0}},
{"input": {"model": "gpt-4", "temperature": 0.7}},
{"input": {"model": "gpt-4", "temperature": 1.0}},
{"input": {"model": "gpt-3.5", "temperature": 0.0}},
{"input": {"model": "gpt-3.5", "temperature": 0.7}},
{"input": {"model": "gpt-3.5", "temperature": 1.0}},
],
)
def test_model_grid(ctx: EvalContext):
ctx.output = run_model(ctx.input["model"], ctx.input["temperature"])
assert ctx.output is not None
Reference
EvalResult schema
EvalContext automatically builds an EvalResult when the evaluation completes. You can also return EvalResult directly:
from ezvals import EvalResult
@eval(dataset="test")
def test_direct():
return EvalResult(
input="...",
output="...",
reference="...", # optional
latency=0.123, # optional (auto-calculated if not provided)
metadata={"model": "gpt-4"}, # optional
run_data={"trace": [...]}, # optional
scores=[{"key": "exact", "passed": True}],
)
Score schema
{
"key": "metric_name", # required
"value": 0.95, # optional: numeric score
"passed": True, # optional: boolean pass/fail
"notes": "...", # optional: justification
}
Evaluators
Callables that add scores to results after execution:
def check_length(result):
return {"key": "length", "passed": len(result.output) > 50}
@eval(input="Explain recursion", evaluators=[check_length], dataset="qa")
async def test_response(ctx: EvalContext):
ctx.output = await my_agent(ctx.input)
CLI
# Run evals headlessly
ezvals run path/to/evals
# Run with web UI
ezvals serve path/to/evals
# Run specific function
ezvals run path/to/evals.py::function_name
Common flags:
-d, --dataset TEXT Filter by dataset(s)
-l, --label TEXT Filter by label(s)
-c, --concurrency INT Number of concurrent evals
--timeout FLOAT Global timeout in seconds
-v, --verbose Show stdout from eval functions
Run flags:
-o, --output FILE Save JSON summary
--visual Show progress dots and results table
--no-save Output JSON to stdout instead of saving
Serve flags:
--session TEXT Session name to group runs
--run-name TEXT Name for this run
--port INT Port (default 8000)
Sessions and runs
Group related eval runs together:
# Named session and run
ezvals serve examples --session model-upgrade --run-name baseline
# Auto-generated friendly names (e.g., "swift-falcon")
ezvals serve examples
Results are saved to .ezvals/runs/ with the pattern {run_name}_{timestamp}.json.
Agent Skill
EZVals includes a skill that teaches AI coding agents how to write and analyze evals.
Install from package (version-matched)
ezvals skills add
Install from marketplace (latest)
npx skills add camronh/evals-skill
Check installation
ezvals skills doctor
The skill installs to .claude/skills/evals/, .cursor/skills/evals/, etc., with symlinks ensuring all agents share the same source. Invoke with /evals in your AI coding agent.
Contributing
uv sync
uv run pytest -q
uv run ruff check ezvals tests
Demo:
uv run ezvals serve examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ezvals-0.1.3.tar.gz.
File metadata
- Download URL: ezvals-0.1.3.tar.gz
- Upload date:
- Size: 263.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f08c9fd5064fb62cd0f916a31cbacb28139d2f4bf98a74cf7727a17a20aff72
|
|
| MD5 |
b455a5ccfdddcda59555d65ef1837b81
|
|
| BLAKE2b-256 |
59b77864c6e08ee880c1a4cefdfabc5c75cef2f43c5b1b3db19012db950dad68
|
File details
Details for the file ezvals-0.1.3-py3-none-any.whl.
File metadata
- Download URL: ezvals-0.1.3-py3-none-any.whl
- Upload date:
- Size: 275.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d15f0088c9cad160bc1995775894f253d3f31ce0c62aa10a556e55e8a95dd76c
|
|
| MD5 |
4ccc0b569173354a4b2e328423b358f8
|
|
| BLAKE2b-256 |
c1f13509fac5f6f2720487dc177fe33b735ce241459adeb35bae8c8328c52440
|