Skip to main content

A lightweight, code-first evaluation framework for testing AI agents and LLM applications

Project description

Twevals

Lightweight evals for AI agents and LLM apps. Write Python functions, use EvalContext to track results, and Twevals handles storage, scoring, and a small web UI.

Installation

Twevals is intended as a development dependency.

pip install twevals
# or with uv
uv add --dev twevals

Quick start

Look at the examples directory for runnable snippets. Run the demo suite and open the UI:

twevals examples --serve

UI screenshot

UI highlights

  • Expand rows to see inputs, outputs, metadata, scores, and annotations.
  • Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.
  • Actions menu: refresh, rerun the suite, export JSON/CSV.

Common flags: --dataset, --label, -c/--concurrency, -q/--quiet, -v/--verbose. Serve-specific: --serve, --dev, --host, --port.

Authoring evals

Write evals like tests. Add a ctx: EvalContext parameter, and Twevals auto-injects a mutable context object for building your evaluation.

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    dataset="customer_service",
    default_score_key="correctness"
)
async def test_refund(ctx: EvalContext):
    # ctx.input already set from decorator
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == "expected refund response", "Validation")
    # No return needed - decorator auto-returns!

EvalContext

EvalContext is a mutable builder that makes writing evals clean and intuitive. When your function has a parameter with type annotation : EvalContext, Twevals automatically injects an instance.

Key features:

  • Auto-injection: Just add ctx: EvalContext parameter
  • Smart methods: add_output(), add_score(), set_params()
  • Auto-return: No explicit return needed
  • IDE support: Full type hints and autocomplete
  • Incremental building: Set fields as you get them
  • Exception safety: Partial data preserved on errors

Core methods:

# Smart output extraction
ctx.add_output({"output": "result", "latency": 0.5, "run_data": {...}})
# Or simple value
ctx.add_output("simple output")

# Flexible scoring
ctx.add_score(True, "Test passed")  # Boolean with default key
ctx.add_score(0.95, "High score", key="similarity")  # Numeric with custom key
ctx.add_score(key="detailed", passed=True, value=0.98, notes="...")  # Full control

# Note: add_score() is optional! If you never call it, the test automatically
# passes with the default score key. Just like pytest - if your test runs
# through without errors, it passes.

# Helper for parametrize
ctx.set_params(model="gpt-4", temperature=0.7)  # Sets both input and metadata

Direct field access:

ctx.input = "test input"
ctx.reference = "expected output"
ctx.metadata = {"model": "gpt-4"}
# ... and more: output, latency, run_data, error

Writing your first eval

The cleanest pattern sets everything you can in the decorator:

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    reference="I'll help you process your refund request.",
    dataset="customer_service",
    default_score_key="correctness",
    metadata={"model": "gpt-4", "version": "1.0"}
)
async def test_refund_request(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Output validation")

Common patterns

1) Set input in function (more dynamic):

@eval(dataset="greetings", default_score_key="politeness")
async def test_greeting(ctx: EvalContext):
    ctx.input = "Hello there"
    ctx.reference = fetch_expected_greeting()

    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Match check")

2) Smart field extraction:

@eval(dataset="qa", default_score_key="accuracy")
async def test_question(ctx: EvalContext):
    ctx.input = "What is the capital of France?"
    ctx.reference = "Paris"

    # Extracts output, latency, run_data, metadata from dict
    ctx.add_output(await run_agent(ctx.input))

    ctx.add_score(ctx.reference.lower() in ctx.output.lower(), "Contains answer")

3) Multiple scores:

@eval(dataset="qa", default_score_key="exact_match")
async def test_multi_score(ctx: EvalContext):
    ctx.input = "What is 2+2?"
    ctx.reference = "4"
    ctx.add_output(await run_agent(ctx.input))

    # Boolean score with default key
    ctx.add_score(ctx.reference in ctx.output, "Exact match")

    # Numeric score with custom key
    similarity = calculate_similarity(ctx.output, ctx.reference)
    ctx.add_score(similarity, "Similarity score", key="similarity")

    # Full control
    ctx.add_score(
        key="confidence",
        value=0.95,
        passed=True,
        notes="High confidence prediction"
    )

4) Explicit return (optional):

@eval(dataset="test")
async def test_explicit(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("output")
    ctx.add_score(True, "Passed", key="test")
    return ctx  # Optional - decorator auto-converts to EvalResult

@eval decorator

Wraps a function and records evaluation results.

Parameters:

  • dataset (str, optional): Groups related evals (defaults to filename)
  • labels (list, optional): Filtering tags
  • evaluators (list, optional): Callables that add scores to a result
  • target (callable, optional): Pre-hook that runs before the eval, populating the EvalContext
  • input (any, optional): Pre-populate ctx.input
  • reference (any, optional): Pre-populate ctx.reference
  • default_score_key (str, optional): Default key for add_score()
  • metadata (dict, optional): Pre-populate ctx.metadata
  • metadata_from_params (list, optional): Auto-extract params to metadata
  • timeout (float, optional): Maximum execution time in seconds for the evaluation

Examples:

# Minimal
@eval()
def test(ctx: EvalContext):
    ...

# With defaults
@eval(
    dataset="my_tests",
    default_score_key="correctness",
    metadata={"version": "1.0"}
)
def test(ctx: EvalContext):
    ...

# Pre-populated input/reference
@eval(
    input="test input",
    reference="expected",
    dataset="static_tests"
)
def test(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ...

# Target hook to run your agent and inject results
def call_agent(ctx: EvalContext):
    # Use any attributes you like on the context
    ctx.trace_id = "abc123"
    ctx.add_output(my_agent(ctx.input), metadata={"trace_id": ctx.trace_id})

@eval(
    target=call_agent,
    input="What is the weather?",
    dataset="agent_calls",
)
def test_with_target(ctx: EvalContext):
    # ctx.output comes from the target hook, ctx.trace_id is preserved
    ctx.add_score("weather" in ctx.output.lower(), notes="Contains answer")
    return ctx.build()

# With timeout to prevent long-running evals
@eval(
    input="complex task",
    timeout=5.0,  # Fails if execution exceeds 5 seconds
    dataset="performance"
)
async def test_with_timeout(ctx: EvalContext):
    ctx.add_output(await slow_agent(ctx.input))
    ctx.add_score(ctx.output is not None, "Completed in time")

If your target returns a value, it is treated as ctx.output by default (dicts are passed to ctx.add_output()).

File-level defaults

Set global properties for all tests in a file using twevals_defaults (similar to pytest's pytestmark):

# Set defaults at the top of your file
twevals_defaults = {
    "dataset": "sentiment_analysis",
    "labels": ["production", "nlp"],
    "default_score_key": "accuracy",
    "metadata": {"model": "gpt-4", "version": "v1.0"}
}

@eval  # Inherits all defaults
def test_positive():
    ...

@eval(labels=["experimental"])  # Override just labels
def test_edge_case():
    ...

Priority: Decorator parameters > File defaults > Built-in defaults

Supported parameters: All @eval decorator parameters including dataset, labels, evaluators, target, input, reference, default_score_key, metadata, and metadata_from_params.

Deep merge: When both file and decorator specify metadata, they are merged (decorator values win on conflicts).

@parametrize

Generate multiple evals from one function. Place @eval above @parametrize.

Auto-mapping magic:

When parameter names match EvalContext fields (input, reference, metadata, etc.), they automatically populate the context:

from twevals import parametrize

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay I guess", "neutral"),
])
def test_sentiment(ctx: EvalContext):
    # ctx.input and ctx.reference auto-populated! ✨

    detected = analyze_sentiment(ctx.input)
    ctx.add_output(detected)
    ctx.add_score(ctx.output == ctx.reference, f"Detected: {detected}")

# Parametrize + targets: param sets are available to the target via ctx.input/ctx.metadata
def call_agent(ctx: EvalContext):
    ctx.add_output(my_agent(ctx.input["prompt"]))

@eval(target=call_agent)
@parametrize("prompt", ["hello", "world"])
def test_prompt(ctx: EvalContext):
    assert "prompt" in ctx.input  # set before target runs
    return ctx.build()

Custom parameters:

@eval(dataset="math", default_score_key="correctness")
@parametrize("operation,a,b,expected", [
    ("add", 2, 3, 5),
    ("multiply", 4, 7, 28),
])
def test_calculator(ctx: EvalContext, operation, a, b, expected):
    ctx.input = {"operation": operation, "a": a, "b": b}
    ctx.reference = expected

    ops = {"add": lambda x, y: x + y, "multiply": lambda x, y: x * y}
    result = ops[operation](a, b)

    ctx.add_output(result)
    ctx.add_score(result == expected, f"{a} {operation} {b} = {result}")

Common patterns:

# 1) Single parameter with IDs
@eval(dataset="thresholds")
@parametrize("threshold", [0.2, 0.5, 0.8], ids=["low", "mid", "high"])
def test_threshold(ctx: EvalContext, threshold):
    ctx.input = threshold
    ctx.add_output(evaluate(threshold))
    ctx.add_score(ctx.output > threshold, "Above threshold")

# 2) Cartesian product (stacked parametrize)
@eval(dataset="models", default_score_key="quality")
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("temperature", [0.0, 0.7, 1.0])
def test_model_grid(ctx: EvalContext, model, temperature):
    ctx.set_params(model=model, temperature=temperature)  # Sets input and metadata
    ctx.add_output(run_model(model, temperature))
    ctx.add_score(score_output(ctx.output), f"Model: {model}")

# 3) Dictionaries for named arguments
@eval(dataset="auth")
@parametrize("username,password,should_succeed", [
    {"username": "alice", "password": "correct", "should_succeed": True},
    {"username": "alice", "password": "wrong", "should_succeed": False},
])
def test_login(ctx: EvalContext, username, password, should_succeed):
    ctx.input = {"username": username}
    result = login(username, password)
    ctx.add_output(result)
    ctx.add_score(result.success == should_succeed, "Login check", key="auth")

Notes:

  • Accepts tuples, dicts, or single values
  • Works with sync or async functions
  • Put @eval above @parametrize
  • Parameter names matching input, reference, etc. auto-populate context

See more patterns in examples/new_demo.py.

Advanced patterns

Assertion preservation

Assertions are treated as validation failures and create failing scores:

@eval(dataset="validation", default_score_key="correctness")
async def test_with_assertion(ctx: EvalContext):
    ctx.input = "test"
    ctx.reference = "expected"
    ctx.metadata = {"model": "gpt-4"}

    ctx.add_output(await run_agent(ctx.input))

    # If this fails, a failing score is added with the assertion message
    # All data (input/output/reference/metadata) is preserved
    assert ctx.output == ctx.reference, "Output mismatch"

    ctx.add_score(True, "All checks passed")

Context manager pattern

For explicit control:

@eval(dataset="test")
async def test_with_context_manager():
    with EvalContext(input="test", default_score_key="accuracy") as ctx:
        ctx.add_output(await run_agent(ctx.input))
        ctx.add_score(True, "Passed")
        return ctx  # Explicit return

Ultra-minimal pattern

The absolute shortest eval (2 lines!):

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [("I love this!", "positive"), ("Terrible!", "negative")])
def test(ctx: EvalContext):
    ctx.add_output(analyze(ctx.input))
    ctx.add_score(ctx.output == ctx.reference)

Reference

EvalResult schema

EvalContext automatically builds an EvalResult object when the evaluation completes. You can also return EvalResult directly if you prefer:

from twevals import EvalResult

@eval(dataset="test")
def test_direct():
    return EvalResult(
        input="...",          # required: test input
        output="...",         # required: system output
        reference="...",      # optional: expected output
        error=None,           # optional: error message
        latency=0.123,        # optional: execution time (auto-calculated if not provided)
        metadata={"model": "gpt-4"},  # optional: metadata for filtering
        run_data={"trace": [...]},     # optional: debug data
        scores={"key": "exact", "passed": True},  # scores dict or list
    )

Score schema

{
    "key": "metric_name",    # required: Name of the metric
    "value": 0.95,           # optional: Numeric score
    "passed": True,          # optional: Boolean pass/fail
    "notes": "...",          # optional: Justification
}
# Must provide at least one of: value or passed

scores accepts a single dict, a list of dicts, or a list of Score objects.

Evaluators

Callables that add scores to results after execution:

def custom_evaluator(result):
    """Returns Score object, dict, or list of either"""
    if result.reference in result.output.lower():
        return {"key": "contains_ref", "passed": True}
    return {"key": "contains_ref", "passed": False}

@eval(dataset="test", evaluators=[custom_evaluator])
def test_with_evaluator(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("test output")
    # custom_evaluator runs after, adds score

Headless runs

Skip the UI and save results to disk:

twevals path/to/evals
# Run specific function: twevals path/to/evals.py::function_name
# Run parametrized variant: twevals path/to/evals.py::function_name[param_id]
# Filtering and other common flags work here as well

Run-only flags: -o/--output (save JSON summary), --csv (save CSV), --json (output compact JSON to stdout), --list (list evaluations without running), --limit (limit number of evals).

Sessions and runs

Group related eval runs together using sessions. This enables workflows like model comparison, iterative debugging, and tracking progress across multiple runs.

Basic usage

# Named session and run
twevals examples --serve --session model-upgrade --run-name gpt5-baseline

# Continue a session (same name = same session)
twevals examples --serve --session model-upgrade --run-name gpt5-tuned

# Auto-generated friendly names (e.g., "swift-falcon", "bright-flame")
twevals examples --serve

How it works

  • Session: A grouping of related runs identified by name. Same --session X = same session.
  • Run: A single execution of evals. Each run creates a new JSON file.
  • File naming: {run_name}_{timestamp}.json (e.g., gpt5-baseline_2025-11-29T15-30-00Z.json)
  • Auto-naming: When not specified, friendly adjective-noun names are generated.

UI display

The stats bar shows the current session and run:

SESSION model-upgrade · RUN gpt5-baseline | TESTS 50 | ACCURACY 45/50 | ...

Storage structure

.twevals/runs/
  gpt5-baseline_2025-11-29T15-30-00Z.json   # named run
  swift-falcon_2025-11-29T15-35-00Z.json    # auto-generated name
  latest.json                                # copy of most recent

JSON schema

Each run file includes session metadata:

{
  "session_name": "model-upgrade",
  "run_name": "gpt5-baseline",
  "run_id": "2025-11-29T15-30-00Z",
  "total_evaluations": 50,
  "results": [...]
}

API endpoints

When running in serve mode, these endpoints are available:

  • GET /api/sessions - List all unique session names
  • GET /api/sessions/{name}/runs - List runs for a session
  • PATCH /api/runs/{run_id} - Update run metadata (e.g., rename)

CLI reference

twevals <path>                  # run evals (default behavior)
twevals <path> --serve          # run evals and launch web UI
twevals <path>::<function>      # run specific function (e.g., tests.py::my_eval)

Common flags:
  -d, --dataset TEXT      Filter by dataset(s)
  -l, --label TEXT        Filter by label(s)
  -c, --concurrency INT   Number of concurrent evals (0 = sequential)
  --timeout FLOAT         Global timeout in seconds (overrides individual test timeouts)
  -q, --quiet             Reduce logs
  -v, --verbose           Verbose logs

Run flags:
  -o, --output FILE       Save JSON summary
  --csv FILE              Save CSV results
  --json                  Output compact JSON to stdout (machine-readable)
  --list                  List evaluations without running
  --limit INT             Limit number of evaluations to run

Session flags (use with --serve):
  --session TEXT          Session name to group runs together
  --run-name TEXT         Name for this run (used as file prefix)

Serve flags (use with --serve):
  --dev                   Enable hot reload
  --host TEXT             Host interface (default 127.0.0.1)
  --port INT              Port (default 8000)

Contributing

uv sync
uv run pytest -q
uv run ruff check twevals tests
uv run black .

Helpful demo:

uv run twevals examples --serve

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twevals-0.0.0.dev20251201015127.tar.gz (9.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

twevals-0.0.0.dev20251201015127-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file twevals-0.0.0.dev20251201015127.tar.gz.

File metadata

File hashes

Hashes for twevals-0.0.0.dev20251201015127.tar.gz
Algorithm Hash digest
SHA256 f2c19631504ab1df12b86921283bbe8b3798a686f689fa0b70c959696d41e5d3
MD5 3621f86c3328e0635aa0420cef8624c3
BLAKE2b-256 35bbd3469ffd95f594292f1552f57a6e61e90b739d269664081f095d7eb6c883

See more details on using hashes here.

File details

Details for the file twevals-0.0.0.dev20251201015127-py3-none-any.whl.

File metadata

File hashes

Hashes for twevals-0.0.0.dev20251201015127-py3-none-any.whl
Algorithm Hash digest
SHA256 e8748e8f4b8d30ce3377974cc2cd304cd6676e0e4f6f96293796cea42a385e2f
MD5 39dad87af2aef07ece64b7bdabba8d5c
BLAKE2b-256 ef71064a6a00f9a258471a7d601d56f299db845bd7af01659f8d7a06471af2a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page