Skip to main content

A lightweight, code-first evaluation framework for testing AI agents and LLM applications

Project description

Twevals

Lightweight evals for AI agents and LLM apps. Write Python functions, use EvalContext to track results, and Twevals handles storage, scoring, and a small web UI.

Installation

Twevals is intended as a development dependency.

pip install twevals
# or with uv
uv add --dev twevals

Quick start

Look at the examples directory for runnable snippets. Run the demo suite and open the UI:

twevals examples --serve

UI screenshot

UI highlights

  • Expand rows to see inputs, outputs, metadata, scores, and annotations.
  • Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.
  • Actions menu: refresh, rerun the suite, export JSON/CSV.

Common flags: --dataset, --label, -c/--concurrency, -q/--quiet, -v/--verbose. Serve-specific: --serve, --dev, --host, --port.

Authoring evals

Write evals like tests. Add a ctx: EvalContext parameter, and Twevals auto-injects a mutable context object for building your evaluation.

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    dataset="customer_service",
    default_score_key="correctness"
)
async def test_refund(ctx: EvalContext):
    # ctx.input already set from decorator
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == "expected refund response", "Validation")
    # No return needed - decorator auto-returns!

EvalContext

EvalContext is a mutable builder that makes writing evals clean and intuitive. When your function has a ctx, context, or carrier parameter, Twevals automatically injects an EvalContext instance.

Key features:

  • Auto-injection: Just add ctx: EvalContext parameter
  • Smart methods: add_output(), add_score(), set_params()
  • Auto-return: No explicit return needed
  • IDE support: Full type hints and autocomplete
  • Incremental building: Set fields as you get them
  • Exception safety: Partial data preserved on errors

Core methods:

# Smart output extraction
ctx.add_output({"output": "result", "latency": 0.5, "run_data": {...}})
# Or simple value
ctx.add_output("simple output")

# Flexible scoring
ctx.add_score(True, "Test passed")  # Boolean with default key
ctx.add_score(0.95, "High score", key="similarity")  # Numeric with custom key
ctx.add_score(key="detailed", passed=True, value=0.98, notes="...")  # Full control

# Note: add_score() is optional! If you never call it, the test automatically
# passes with the default score key. Just like pytest - if your test runs
# through without errors, it passes.

# Helper for parametrize
ctx.set_params(model="gpt-4", temperature=0.7)  # Sets both input and metadata

Direct field access:

ctx.input = "test input"
ctx.reference = "expected output"
ctx.metadata = {"model": "gpt-4"}
# ... and more: output, latency, run_data, error

Writing your first eval

The cleanest pattern sets everything you can in the decorator:

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    reference="I'll help you process your refund request.",
    dataset="customer_service",
    default_score_key="correctness",
    metadata={"model": "gpt-4", "version": "1.0"}
)
async def test_refund_request(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Output validation")

Common patterns

1) Set input in function (more dynamic):

@eval(dataset="greetings", default_score_key="politeness")
async def test_greeting(ctx: EvalContext):
    ctx.input = "Hello there"
    ctx.reference = fetch_expected_greeting()

    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Match check")

2) Smart field extraction:

@eval(dataset="qa", default_score_key="accuracy")
async def test_question(ctx: EvalContext):
    ctx.input = "What is the capital of France?"
    ctx.reference = "Paris"

    # Extracts output, latency, run_data, metadata from dict
    ctx.add_output(await run_agent(ctx.input))

    ctx.add_score(ctx.reference.lower() in ctx.output.lower(), "Contains answer")

3) Multiple scores:

@eval(dataset="qa", default_score_key="exact_match")
async def test_multi_score(ctx: EvalContext):
    ctx.input = "What is 2+2?"
    ctx.reference = "4"
    ctx.add_output(await run_agent(ctx.input))

    # Boolean score with default key
    ctx.add_score(ctx.reference in ctx.output, "Exact match")

    # Numeric score with custom key
    similarity = calculate_similarity(ctx.output, ctx.reference)
    ctx.add_score(similarity, "Similarity score", key="similarity")

    # Full control
    ctx.add_score(
        key="confidence",
        value=0.95,
        passed=True,
        notes="High confidence prediction"
    )

4) Explicit return (optional):

@eval(dataset="test")
async def test_explicit(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("output")
    ctx.add_score(True, "Passed", key="test")
    return ctx  # Optional - decorator auto-converts to EvalResult

@eval decorator

Wraps a function and records evaluation results.

Parameters:

  • dataset (str, optional): Groups related evals (defaults to filename)
  • labels (list, optional): Filtering tags
  • evaluators (list, optional): Callables that add scores to a result
  • target (callable, optional): Pre-hook that runs before the eval, populating the EvalContext
  • input (any, optional): Pre-populate ctx.input
  • reference (any, optional): Pre-populate ctx.reference
  • default_score_key (str, optional): Default key for add_score()
  • metadata (dict, optional): Pre-populate ctx.metadata
  • metadata_from_params (list, optional): Auto-extract params to metadata
  • timeout (float, optional): Maximum execution time in seconds for the evaluation

Examples:

# Minimal
@eval()
def test(ctx: EvalContext):
    ...

# With defaults
@eval(
    dataset="my_tests",
    default_score_key="correctness",
    metadata={"version": "1.0"}
)
def test(ctx: EvalContext):
    ...

# Pre-populated input/reference
@eval(
    input="test input",
    reference="expected",
    dataset="static_tests"
)
def test(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ...

# Target hook to run your agent and inject results
def call_agent(ctx: EvalContext):
    # Use any attributes you like on the context
    ctx.trace_id = "abc123"
    ctx.add_output(my_agent(ctx.input), metadata={"trace_id": ctx.trace_id})

@eval(
    target=call_agent,
    input="What is the weather?",
    dataset="agent_calls",
)
def test_with_target(ctx: EvalContext):
    # ctx.output comes from the target hook, ctx.trace_id is preserved
    ctx.add_score("weather" in ctx.output.lower(), notes="Contains answer")
    return ctx.build()

# With timeout to prevent long-running evals
@eval(
    input="complex task",
    timeout=5.0,  # Fails if execution exceeds 5 seconds
    dataset="performance"
)
async def test_with_timeout(ctx: EvalContext):
    ctx.add_output(await slow_agent(ctx.input))
    ctx.add_score(ctx.output is not None, "Completed in time")

If your target returns a value, it is treated as ctx.output by default (dicts are passed to ctx.add_output()).

File-level defaults

Set global properties for all tests in a file using twevals_defaults (similar to pytest's pytestmark):

# Set defaults at the top of your file
twevals_defaults = {
    "dataset": "sentiment_analysis",
    "labels": ["production", "nlp"],
    "default_score_key": "accuracy",
    "metadata": {"model": "gpt-4", "version": "v1.0"}
}

@eval  # Inherits all defaults
def test_positive():
    ...

@eval(labels=["experimental"])  # Override just labels
def test_edge_case():
    ...

Priority: Decorator parameters > File defaults > Built-in defaults

Supported parameters: All @eval decorator parameters including dataset, labels, evaluators, target, input, reference, default_score_key, metadata, and metadata_from_params.

Deep merge: When both file and decorator specify metadata, they are merged (decorator values win on conflicts).

@parametrize

Generate multiple evals from one function. Place @eval above @parametrize.

Auto-mapping magic:

When parameter names match EvalContext fields (input, reference, metadata, etc.), they automatically populate the context:

from twevals import parametrize

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay I guess", "neutral"),
])
def test_sentiment(ctx: EvalContext):
    # ctx.input and ctx.reference auto-populated! ✨

    detected = analyze_sentiment(ctx.input)
    ctx.add_output(detected)
    ctx.add_score(ctx.output == ctx.reference, f"Detected: {detected}")

# Parametrize + targets: param sets are available to the target via ctx.input/ctx.metadata
def call_agent(ctx: EvalContext):
    ctx.add_output(my_agent(ctx.input["prompt"]))

@eval(target=call_agent)
@parametrize("prompt", ["hello", "world"])
def test_prompt(ctx: EvalContext):
    assert "prompt" in ctx.input  # set before target runs
    return ctx.build()

Custom parameters:

@eval(dataset="math", default_score_key="correctness")
@parametrize("operation,a,b,expected", [
    ("add", 2, 3, 5),
    ("multiply", 4, 7, 28),
])
def test_calculator(ctx: EvalContext, operation, a, b, expected):
    ctx.input = {"operation": operation, "a": a, "b": b}
    ctx.reference = expected

    ops = {"add": lambda x, y: x + y, "multiply": lambda x, y: x * y}
    result = ops[operation](a, b)

    ctx.add_output(result)
    ctx.add_score(result == expected, f"{a} {operation} {b} = {result}")

Common patterns:

# 1) Single parameter with IDs
@eval(dataset="thresholds")
@parametrize("threshold", [0.2, 0.5, 0.8], ids=["low", "mid", "high"])
def test_threshold(ctx: EvalContext, threshold):
    ctx.input = threshold
    ctx.add_output(evaluate(threshold))
    ctx.add_score(ctx.output > threshold, "Above threshold")

# 2) Cartesian product (stacked parametrize)
@eval(dataset="models", default_score_key="quality")
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("temperature", [0.0, 0.7, 1.0])
def test_model_grid(ctx: EvalContext, model, temperature):
    ctx.set_params(model=model, temperature=temperature)  # Sets input and metadata
    ctx.add_output(run_model(model, temperature))
    ctx.add_score(score_output(ctx.output), f"Model: {model}")

# 3) Dictionaries for named arguments
@eval(dataset="auth")
@parametrize("username,password,should_succeed", [
    {"username": "alice", "password": "correct", "should_succeed": True},
    {"username": "alice", "password": "wrong", "should_succeed": False},
])
def test_login(ctx: EvalContext, username, password, should_succeed):
    ctx.input = {"username": username}
    result = login(username, password)
    ctx.add_output(result)
    ctx.add_score(result.success == should_succeed, "Login check", key="auth")

Notes:

  • Accepts tuples, dicts, or single values
  • Works with sync or async functions
  • Put @eval above @parametrize
  • Parameter names matching input, reference, etc. auto-populate context

See more patterns in examples/new_demo.py.

Advanced patterns

Assertion preservation

Assertions are treated as validation failures and create failing scores:

@eval(dataset="validation", default_score_key="correctness")
async def test_with_assertion(ctx: EvalContext):
    ctx.input = "test"
    ctx.reference = "expected"
    ctx.metadata = {"model": "gpt-4"}

    ctx.add_output(await run_agent(ctx.input))

    # If this fails, a failing score is added with the assertion message
    # All data (input/output/reference/metadata) is preserved
    assert ctx.output == ctx.reference, "Output mismatch"

    ctx.add_score(True, "All checks passed")

Context manager pattern

For explicit control:

@eval(dataset="test")
async def test_with_context_manager():
    with EvalContext(input="test", default_score_key="accuracy") as ctx:
        ctx.add_output(await run_agent(ctx.input))
        ctx.add_score(True, "Passed")
        return ctx  # Explicit return

Ultra-minimal pattern

The absolute shortest eval (2 lines!):

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [("I love this!", "positive"), ("Terrible!", "negative")])
def test(ctx: EvalContext):
    ctx.add_output(analyze(ctx.input))
    ctx.add_score(ctx.output == ctx.reference)

Reference

EvalResult schema

EvalContext automatically builds an EvalResult object when the evaluation completes. You can also return EvalResult directly if you prefer:

from twevals import EvalResult

@eval(dataset="test")
def test_direct():
    return EvalResult(
        input="...",          # required: test input
        output="...",         # required: system output
        reference="...",      # optional: expected output
        error=None,           # optional: error message
        latency=0.123,        # optional: execution time (auto-calculated if not provided)
        metadata={"model": "gpt-4"},  # optional: metadata for filtering
        run_data={"trace": [...]},     # optional: debug data
        scores={"key": "exact", "passed": True},  # scores dict or list
    )

Score schema

{
    "key": "metric_name",    # required: Name of the metric
    "value": 0.95,           # optional: Numeric score
    "passed": True,          # optional: Boolean pass/fail
    "notes": "...",          # optional: Justification
}
# Must provide at least one of: value or passed

scores accepts a single dict, a list of dicts, or a list of Score objects.

Evaluators

Callables that add scores to results after execution:

def custom_evaluator(result):
    """Returns Score object, dict, or list of either"""
    if result.reference in result.output.lower():
        return {"key": "contains_ref", "passed": True}
    return {"key": "contains_ref", "passed": False}

@eval(dataset="test", evaluators=[custom_evaluator])
def test_with_evaluator(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("test output")
    # custom_evaluator runs after, adds score

Headless runs

Skip the UI and save results to disk:

twevals path/to/evals
# Run specific function: twevals path/to/evals.py::function_name
# Run parametrized variant: twevals path/to/evals.py::function_name[param_id]
# Filtering and other common flags work here as well

Run-only flags: -o/--output (save JSON summary), --csv (save CSV), --json (output compact JSON to stdout), --list (list evaluations without running), --limit (limit number of evals).

CLI reference

twevals <path>                  # run evals (default behavior)
twevals <path> --serve          # run evals and launch web UI
twevals <path>::<function>      # run specific function (e.g., tests.py::my_eval)

Common flags:
  -d, --dataset TEXT      Filter by dataset(s)
  -l, --label TEXT        Filter by label(s)
  -c, --concurrency INT   Number of concurrent evals (0 = sequential)
  --timeout FLOAT         Global timeout in seconds (overrides individual test timeouts)
  -q, --quiet             Reduce logs
  -v, --verbose           Verbose logs

Run flags:
  -o, --output FILE       Save JSON summary
  --csv FILE              Save CSV results
  --json                  Output compact JSON to stdout (machine-readable)
  --list                  List evaluations without running
  --limit INT             Limit number of evaluations to run

Serve flags (use with --serve):
  --dev                   Enable hot reload
  --host TEXT             Host interface (default 127.0.0.1)
  --port INT              Port (default 8000)

Contributing

uv sync
uv run pytest -q
uv run ruff check twevals tests
uv run black .

Helpful demo:

uv run twevals examples --serve

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twevals-0.0.0.dev20251126194341.tar.gz (482.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

twevals-0.0.0.dev20251126194341-py3-none-any.whl (49.6 kB view details)

Uploaded Python 3

File details

Details for the file twevals-0.0.0.dev20251126194341.tar.gz.

File metadata

File hashes

Hashes for twevals-0.0.0.dev20251126194341.tar.gz
Algorithm Hash digest
SHA256 942a143105be4bb1dd266b6c74a68a3e4cfe03813fce0e00aab24867149e48a2
MD5 9a7b862ed3b09300f6544f5d8ae4fcd8
BLAKE2b-256 1333b061ecab87029ea14bda613efe33f2fb391be39826bffb5e0ca6737df736

See more details on using hashes here.

File details

Details for the file twevals-0.0.0.dev20251126194341-py3-none-any.whl.

File metadata

File hashes

Hashes for twevals-0.0.0.dev20251126194341-py3-none-any.whl
Algorithm Hash digest
SHA256 f2d6b5899c2d545eb304bbbf48f43af52a893d1502dbb451de51fb915074acdd
MD5 e3e0f7b10511d86c7cca7ba8d9b5f481
BLAKE2b-256 633473e815347a5aa88d5bbdf4d58b97307cbf965cc419f606c8a5f76974eb4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page