Skip to main content

A lightweight, code-first evaluation framework for testing AI agents and LLM applications

Project description

Twevals

Lightweight evals for AI agents and LLM apps. Write Python functions, use EvalContext to track results, and Twevals handles storage, scoring, and a small web UI.

Installation

Twevals is intended as a development dependency.

pip install twevals
# or with uv
uv add --dev twevals

Quick start

Look at the examples directory for runnable snippets. Run the demo suite and open the UI:

twevals serve examples

UI screenshot

UI highlights

  • Expand rows to see inputs, outputs, metadata, scores, and annotations.
  • Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.
  • Actions menu: refresh, rerun the suite, export JSON/CSV.

Common serve flags: --dataset, --label, -c/--concurrency, --dev, --host, --port, -q/--quiet, -v/--verbose.

Authoring evals

Write evals like tests. Add a ctx: EvalContext parameter, and Twevals auto-injects a mutable context object for building your evaluation.

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    dataset="customer_service",
    default_score_key="correctness"
)
async def test_refund(ctx: EvalContext):
    # ctx.input already set from decorator
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == "expected refund response", "Validation")
    # No return needed - decorator auto-returns!

EvalContext

EvalContext is a mutable builder that makes writing evals clean and intuitive. When your function has a ctx, context, or carrier parameter, Twevals automatically injects an EvalContext instance.

Key features:

  • Auto-injection: Just add ctx: EvalContext parameter
  • Smart methods: add_output(), add_score(), set_params()
  • Auto-return: No explicit return needed
  • IDE support: Full type hints and autocomplete
  • Incremental building: Set fields as you get them
  • Exception safety: Partial data preserved on errors

Core methods:

# Smart output extraction
ctx.add_output({"output": "result", "latency": 0.5, "run_data": {...}})
# Or simple value
ctx.add_output("simple output")

# Flexible scoring
ctx.add_score(True, "Test passed")  # Boolean with default key
ctx.add_score(0.95, "High score", key="similarity")  # Numeric with custom key
ctx.add_score(key="detailed", passed=True, value=0.98, notes="...")  # Full control

# Note: add_score() is optional! If you never call it, the test automatically
# passes with the default score key. Just like pytest - if your test runs
# through without errors, it passes.

# Helper for parametrize
ctx.set_params(model="gpt-4", temperature=0.7)  # Sets both input and metadata

Direct field access:

ctx.input = "test input"
ctx.reference = "expected output"
ctx.metadata = {"model": "gpt-4"}
# ... and more: output, latency, run_data, error

Writing your first eval

The cleanest pattern sets everything you can in the decorator:

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    reference="I'll help you process your refund request.",
    dataset="customer_service",
    default_score_key="correctness",
    metadata={"model": "gpt-4", "version": "1.0"}
)
async def test_refund_request(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Output validation")

Common patterns

1) Set input in function (more dynamic):

@eval(dataset="greetings", default_score_key="politeness")
async def test_greeting(ctx: EvalContext):
    ctx.input = "Hello there"
    ctx.reference = fetch_expected_greeting()

    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Match check")

2) Smart field extraction:

@eval(dataset="qa", default_score_key="accuracy")
async def test_question(ctx: EvalContext):
    ctx.input = "What is the capital of France?"
    ctx.reference = "Paris"

    # Extracts output, latency, run_data, metadata from dict
    ctx.add_output(await run_agent(ctx.input))

    ctx.add_score(ctx.reference.lower() in ctx.output.lower(), "Contains answer")

3) Multiple scores:

@eval(dataset="qa", default_score_key="exact_match")
async def test_multi_score(ctx: EvalContext):
    ctx.input = "What is 2+2?"
    ctx.reference = "4"
    ctx.add_output(await run_agent(ctx.input))

    # Boolean score with default key
    ctx.add_score(ctx.reference in ctx.output, "Exact match")

    # Numeric score with custom key
    similarity = calculate_similarity(ctx.output, ctx.reference)
    ctx.add_score(similarity, "Similarity score", key="similarity")

    # Full control
    ctx.add_score(
        key="confidence",
        value=0.95,
        passed=True,
        notes="High confidence prediction"
    )

4) Explicit return (optional):

@eval(dataset="test")
async def test_explicit(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("output")
    ctx.add_score(True, "Passed", key="test")
    return ctx  # Optional - decorator auto-converts to EvalResult

@eval decorator

Wraps a function and records evaluation results.

Parameters:

  • dataset (str, optional): Groups related evals (defaults to filename)
  • labels (list, optional): Filtering tags
  • evaluators (list, optional): Callables that add scores to a result
  • input (any, optional): Pre-populate ctx.input
  • reference (any, optional): Pre-populate ctx.reference
  • default_score_key (str, optional): Default key for add_score()
  • metadata (dict, optional): Pre-populate ctx.metadata
  • metadata_from_params (list, optional): Auto-extract params to metadata

Examples:

# Minimal
@eval()
def test(ctx: EvalContext):
    ...

# With defaults
@eval(
    dataset="my_tests",
    default_score_key="correctness",
    metadata={"version": "1.0"}
)
def test(ctx: EvalContext):
    ...

# Pre-populated input/reference
@eval(
    input="test input",
    reference="expected",
    dataset="static_tests"
)
def test(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ...

@parametrize

Generate multiple evals from one function. Place @eval above @parametrize.

Auto-mapping magic:

When parameter names match EvalContext fields (input, reference, metadata, etc.), they automatically populate the context:

from twevals import parametrize

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay I guess", "neutral"),
])
def test_sentiment(ctx: EvalContext):
    # ctx.input and ctx.reference auto-populated! ✨

    detected = analyze_sentiment(ctx.input)
    ctx.add_output(detected)
    ctx.add_score(ctx.output == ctx.reference, f"Detected: {detected}")

Custom parameters:

@eval(dataset="math", default_score_key="correctness")
@parametrize("operation,a,b,expected", [
    ("add", 2, 3, 5),
    ("multiply", 4, 7, 28),
])
def test_calculator(ctx: EvalContext, operation, a, b, expected):
    ctx.input = {"operation": operation, "a": a, "b": b}
    ctx.reference = expected

    ops = {"add": lambda x, y: x + y, "multiply": lambda x, y: x * y}
    result = ops[operation](a, b)

    ctx.add_output(result)
    ctx.add_score(result == expected, f"{a} {operation} {b} = {result}")

Common patterns:

# 1) Single parameter with IDs
@eval(dataset="thresholds")
@parametrize("threshold", [0.2, 0.5, 0.8], ids=["low", "mid", "high"])
def test_threshold(ctx: EvalContext, threshold):
    ctx.input = threshold
    ctx.add_output(evaluate(threshold))
    ctx.add_score(ctx.output > threshold, "Above threshold")

# 2) Cartesian product (stacked parametrize)
@eval(dataset="models", default_score_key="quality")
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("temperature", [0.0, 0.7, 1.0])
def test_model_grid(ctx: EvalContext, model, temperature):
    ctx.set_params(model=model, temperature=temperature)  # Sets input and metadata
    ctx.add_output(run_model(model, temperature))
    ctx.add_score(score_output(ctx.output), f"Model: {model}")

# 3) Dictionaries for named arguments
@eval(dataset="auth")
@parametrize("username,password,should_succeed", [
    {"username": "alice", "password": "correct", "should_succeed": True},
    {"username": "alice", "password": "wrong", "should_succeed": False},
])
def test_login(ctx: EvalContext, username, password, should_succeed):
    ctx.input = {"username": username}
    result = login(username, password)
    ctx.add_output(result)
    ctx.add_score(result.success == should_succeed, "Login check", key="auth")

Notes:

  • Accepts tuples, dicts, or single values
  • Works with sync or async functions
  • Put @eval above @parametrize
  • Parameter names matching input, reference, etc. auto-populate context

See more patterns in examples/new_demo.py.

Advanced patterns

Assertion preservation

Assertions raise exceptions but preserve partial context data:

@eval(dataset="validation", default_score_key="correctness")
async def test_with_assertion(ctx: EvalContext):
    ctx.input = "test"
    ctx.reference = "expected"
    ctx.metadata = {"model": "gpt-4"}

    ctx.add_output(await run_agent(ctx.input))

    # If this fails, you still get input/output/reference/metadata in the result!
    assert ctx.output == ctx.reference, "Output mismatch"

    ctx.add_score(True, "All checks passed")

Context manager pattern

For explicit control:

@eval(dataset="test")
async def test_with_context_manager():
    with EvalContext(input="test", default_score_key="accuracy") as ctx:
        ctx.add_output(await run_agent(ctx.input))
        ctx.add_score(True, "Passed")
        return ctx  # Explicit return

Ultra-minimal pattern

The absolute shortest eval (2 lines!):

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [("I love this!", "positive"), ("Terrible!", "negative")])
def test(ctx: EvalContext):
    ctx.add_output(analyze(ctx.input))
    ctx.add_score(ctx.output == ctx.reference)

Reference

EvalResult schema

EvalContext automatically builds an EvalResult object when the evaluation completes. You can also return EvalResult directly if you prefer:

from twevals import EvalResult

@eval(dataset="test")
def test_direct():
    return EvalResult(
        input="...",          # required: test input
        output="...",         # required: system output
        reference="...",      # optional: expected output
        error=None,           # optional: error message
        latency=0.123,        # optional: execution time (auto-calculated if not provided)
        metadata={"model": "gpt-4"},  # optional: metadata for filtering
        run_data={"trace": [...]},     # optional: debug data
        scores={"key": "exact", "passed": True},  # scores dict or list
    )

Score schema

{
    "key": "metric_name",    # required: Name of the metric
    "value": 0.95,           # optional: Numeric score
    "passed": True,          # optional: Boolean pass/fail
    "notes": "...",          # optional: Justification
}
# Must provide at least one of: value or passed

scores accepts a single dict, a list of dicts, or a list of Score objects.

Evaluators

Callables that add scores to results after execution:

def custom_evaluator(result):
    """Returns Score object, dict, or list of either"""
    if result.reference in result.output.lower():
        return {"key": "contains_ref", "passed": True}
    return {"key": "contains_ref", "passed": False}

@eval(dataset="test", evaluators=[custom_evaluator])
def test_with_evaluator(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("test output")
    # custom_evaluator runs after, adds score

Headless runs

Skip the UI and save results to disk:

twevals run path/to/evals
# Filtering and other common flags work here as well

run-only flags: -o/--output (save JSON summary), --csv (save CSV).

CLI reference

twevals serve <path>   # run evals once and launch the web UI
twevals run <path>     # run without UI

Common flags:
  -d, --dataset TEXT      Filter by dataset(s)
  -l, --label TEXT        Filter by label(s)
  -c, --concurrency INT   Number of concurrent evals (0 = sequential)
  -q, --quiet             Reduce logs
  -v, --verbose           Verbose logs

serve-only:
  --dev                   Enable hot reload
  --host TEXT             Host interface (default 127.0.0.1)
  --port INT              Port (default 8000)

run-only:
  -o, --output FILE       Save JSON summary
  --csv FILE              Save CSV results

Contributing

uv sync
uv run pytest -q
uv run ruff check twevals tests
uv run black .

Helpful demo:

uv run twevals serve examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twevals-0.0.0.dev20251122210529.tar.gz (440.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

twevals-0.0.0.dev20251122210529-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file twevals-0.0.0.dev20251122210529.tar.gz.

File metadata

File hashes

Hashes for twevals-0.0.0.dev20251122210529.tar.gz
Algorithm Hash digest
SHA256 0fd8b93e90a02f9e085ead6913b7570234a77b2f6eb617d001ff4cb20c6e7856
MD5 f76fc7ec47649c3de5c6352dffe85de5
BLAKE2b-256 cbc6d36d338f4106e98030bb6741bfdc5eaa7c3ce6303f3936108d398dd46a06

See more details on using hashes here.

File details

Details for the file twevals-0.0.0.dev20251122210529-py3-none-any.whl.

File metadata

File hashes

Hashes for twevals-0.0.0.dev20251122210529-py3-none-any.whl
Algorithm Hash digest
SHA256 d70662acec07b83a991056aa364f586cf574c9b812a9e7d6b9355cdb57e25d74
MD5 f0d98c4d67e1398f7db61965f6da682d
BLAKE2b-256 14bbb3d5173dfc2dfc49ae84e7c42366c10e8d5431a058681675273a94d0cfc3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page