A lightweight, code-first evaluation framework for testing AI agents and LLM applications

These details have not been verified by PyPI

Project links

Project description

EZVals

Unit Testing for AI agents and LLM apps. Write Python functions, use EvalContext to track results, and EZVals handles storage, scoring, and a small web UI.

Installation

EZVals is intended as a development dependency.

pip install ezvals
# or with uv
uv add --dev ezvals

Quick start

Look at the examples directory for runnable snippets. Run the demo suite and open the UI:

ezvals serve examples

UI screenshot

UI highlights

Expand rows to see inputs, outputs, metadata, scores, and annotations.
Edit scores or annotations inline; changes persist to JSON.
Export dropdown: JSON, CSV (raw data), PDF, Markdown (filtered view with charts).

Authoring evals

Write evals like tests. Add a ctx: EvalContext parameter, and EZVals auto-injects a mutable context object.

from ezvals import eval, EvalContext

@eval(input="I want a refund", dataset="customer_service")
async def test_refund(ctx: EvalContext):
    ctx.output = await run_agent(ctx.input)
    assert "refund" in ctx.output.lower(), "Should acknowledge refund"

EvalContext

EvalContext is a mutable builder for constructing eval results. When your function has a parameter with type annotation : EvalContext, EZVals automatically injects an instance.

Key features:

Auto-injection: Just add ctx: EvalContext parameter
Direct assignment: Set ctx.output, ctx.input, ctx.reference directly
Assertion-based scoring: Use assert statements like pytest
Auto-return: No explicit return needed
Exception safety: Partial data preserved on errors

Direct field access:

ctx.input = "test input"
ctx.output = "model response"
ctx.reference = "expected output"
ctx.metadata["model"] = "gpt-4"

Scoring with assertions:

assert ctx.output is not None, "Got no output"
assert "expected" in ctx.output.lower(), "Missing expected content"

Manual scoring (when needed):

ctx.add_score(True, "Test passed")  # Boolean
ctx.add_score(0.95, "High score", key="similarity")  # Numeric

Writing your first eval

Set context fields in the decorator when possible:

from ezvals import eval, EvalContext

@eval(
    input="I want a refund",
    reference="I'll help you process your refund request.",
    dataset="customer_service",
    metadata={"model": "gpt-4"}
)
async def test_refund_request(ctx: EvalContext):
    ctx.output = await run_agent(ctx.input)
    assert ctx.output == ctx.reference

Common patterns

1) Assertions (preferred):

@eval(input="What is 2+2?", reference="4", dataset="math")
async def test_arithmetic(ctx: EvalContext):
    ctx.output = await calculator(ctx.input)
    assert ctx.output == ctx.reference

2) Multiple assertions:

@eval(input="Explain quantum computing", dataset="qa")
async def test_explanation(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)

    assert len(ctx.output) > 50, "Response too short"
    assert "quantum" in ctx.output.lower(), "Should mention quantum"

3) Multiple named scores:

@eval(input="Classify this text", dataset="classification")
async def test_classifier(ctx: EvalContext):
    result = await classifier(ctx.input)
    ctx.output = result["label"]

    ctx.add_score(result["confidence"] > 0.8, "High confidence", key="confidence")
    ctx.add_score("positive" in result["label"], "Sentiment detected", key="sentiment")

`@eval` decorator

Wraps a function and records evaluation results.

Parameters:

input (any): Pre-populate ctx.input
reference (any): Pre-populate ctx.reference
dataset (str): Groups related evals (defaults to filename)
labels (list): Filtering tags
metadata (dict): Pre-populate ctx.metadata
default_score_key (str): Default key for add_score()
timeout (float): Maximum execution time in seconds
target (callable): Pre-hook that runs before the eval
evaluators (list): Callables that add scores to a result

Examples:

# Minimal
@eval(input="test")
def test(ctx: EvalContext):
    ctx.output = process(ctx.input)
    assert ctx.output

# With timeout
@eval(input="complex task", timeout=5.0, dataset="performance")
async def test_with_timeout(ctx: EvalContext):
    ctx.output = await slow_agent(ctx.input)

# Target hook to run your agent
def call_agent(ctx: EvalContext):
    ctx.output = my_agent(ctx.input)

@eval(input="What is the weather?", target=call_agent, dataset="agent")
def test_with_target(ctx: EvalContext):
    assert "weather" in ctx.output.lower()

File-level defaults

Set global properties for all tests in a file using ezvals_defaults:

ezvals_defaults = {
    "dataset": "sentiment_analysis",
    "labels": ["production", "nlp"],
    "metadata": {"model": "gpt-4"}
}

@eval(input="I love this!")
def test_positive(ctx: EvalContext):
    ctx.output = analyze(ctx.input)
    assert ctx.output == "positive"

@eval(input="This is terrible", labels=["experimental"])  # Override labels
def test_negative(ctx: EvalContext):
    ctx.output = analyze(ctx.input)
    assert ctx.output == "negative"

Priority: Decorator parameters > File defaults > Built-in defaults

`@parametrize`

Generate multiple evals from one function. Place @eval above @parametrize.

When parameter names match EvalContext fields (input, reference, metadata, etc.), they automatically populate the context:

from ezvals import parametrize

@eval(dataset="sentiment")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay I guess", "neutral"),
])
def test_sentiment(ctx: EvalContext):
    ctx.output = analyze_sentiment(ctx.input)
    assert ctx.output == ctx.reference

Custom parameters:

@eval(dataset="math")
@parametrize("a,b,expected", [
    (2, 3, 5),
    (4, 7, 28),
])
def test_calculator(ctx: EvalContext, a, b, expected):
    ctx.input = {"a": a, "b": b}
    ctx.output = a + b
    assert ctx.output == expected

Cartesian product (stacked parametrize):

@eval(dataset="models")
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("temperature", [0.0, 0.7, 1.0])
def test_model_grid(ctx: EvalContext, model, temperature):
    ctx.input = {"model": model, "temperature": temperature}
    ctx.output = run_model(model, temperature)
    assert ctx.output is not None

Reference

EvalResult schema

EvalContext automatically builds an EvalResult when the evaluation completes. You can also return EvalResult directly:

from ezvals import EvalResult

@eval(dataset="test")
def test_direct():
    return EvalResult(
        input="...",
        output="...",
        reference="...",      # optional
        latency=0.123,        # optional (auto-calculated if not provided)
        metadata={"model": "gpt-4"},  # optional
        run_data={"trace": [...]},     # optional
        scores=[{"key": "exact", "passed": True}],
    )

Score schema

{
    "key": "metric_name",    # required
    "value": 0.95,           # optional: numeric score
    "passed": True,          # optional: boolean pass/fail
    "notes": "...",          # optional: justification
}

Evaluators

Callables that add scores to results after execution:

def check_length(result):
    return {"key": "length", "passed": len(result.output) > 50}

@eval(input="Explain recursion", evaluators=[check_length], dataset="qa")
async def test_response(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)

CLI

# Run evals headlessly
ezvals run path/to/evals

# Run with web UI
ezvals serve path/to/evals

# Run specific function
ezvals run path/to/evals.py::function_name

Common flags:

-d, --dataset TEXT      Filter by dataset(s)
-l, --label TEXT        Filter by label(s)
-c, --concurrency INT   Number of concurrent evals
--timeout FLOAT         Global timeout in seconds
-v, --verbose           Show stdout from eval functions

Run flags:

-o, --output FILE       Save JSON summary
--visual                Show progress dots and results table
--no-save               Output JSON to stdout instead of saving

Serve flags:

--session TEXT          Session name to group runs
--run-name TEXT         Name for this run
--port INT              Port (default 8000)

Sessions and runs

Group related eval runs together:

# Named session and run
ezvals serve examples --session model-upgrade --run-name baseline

# Auto-generated friendly names (e.g., "swift-falcon")
ezvals serve examples

Results are saved to .ezvals/runs/ with the pattern {run_name}_{timestamp}.json.

Contributing

uv sync
uv run pytest -q
uv run ruff check ezvals tests

Demo:

uv run ezvals serve examples

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.10

Feb 24, 2026

0.1.7

Feb 18, 2026

0.1.6

Feb 13, 2026

0.1.5

Feb 8, 2026

0.1.4

Feb 6, 2026

0.1.3

Jan 25, 2026

0.1.2

Jan 25, 2026

0.1.1

Jan 3, 2026

0.1.0

Jan 3, 2026

0.0.0.dev20260125014904 pre-release

Jan 25, 2026

0.0.0.dev20260103160836 pre-release

Jan 3, 2026

0.0.0.dev20260103013551 pre-release

Jan 3, 2026

This version

0.0.0.dev20260103013331 pre-release

Jan 3, 2026

0.0.0.dev20251217001446 pre-release

Dec 17, 2025

0.0.0.dev20251206181601 pre-release

Dec 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ezvals-0.0.0.dev20260103013331.tar.gz (1.5 MB view details)

Uploaded Jan 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ezvals-0.0.0.dev20260103013331-py3-none-any.whl (194.5 kB view details)

Uploaded Jan 3, 2026 Python 3

File details

Details for the file ezvals-0.0.0.dev20260103013331.tar.gz.

File metadata

Download URL: ezvals-0.0.0.dev20260103013331.tar.gz
Upload date: Jan 3, 2026
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ezvals-0.0.0.dev20260103013331.tar.gz
Algorithm	Hash digest
SHA256	`b1dd6947b5306584e3c173b9441e6de5c8bb64a63c67ce2c6f569fc886a1b2ba`
MD5	`9e2c167222951ba8cf3c49c5e05cb11c`
BLAKE2b-256	`f80651679b723ed8769c6fe3899d4557f3b9f8042c3c86f5cb3b51793a021c58`

See more details on using hashes here.

File details

Details for the file ezvals-0.0.0.dev20260103013331-py3-none-any.whl.

File metadata

Download URL: ezvals-0.0.0.dev20260103013331-py3-none-any.whl
Upload date: Jan 3, 2026
Size: 194.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ezvals-0.0.0.dev20260103013331-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d0ad17beb1baad1372b6c3f1acec0feb3cf3936561caee8ecd867a58663ca282`
MD5	`2a2f41eac789eb86abaf2f5a93e14d4a`
BLAKE2b-256	`7b61bd78f3f2bf61fb1965d11eac9637812397bff3137d5814f074d5fbb4bfe4`

See more details on using hashes here.

ezvals 0.0.0.dev20260103013331

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EZVals

Installation

Quick start

UI highlights

Authoring evals

EvalContext

Writing your first eval

Common patterns

@eval decorator

File-level defaults

@parametrize

Reference

EvalResult schema

Score schema

Evaluators

CLI

Sessions and runs

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`@eval` decorator

`@parametrize`