A lightweight, code-first evaluation framework for testing AI agents and LLM applications

These details have not been verified by PyPI

Project links

Project description

Twevals

Lightweight evals for AI agents and LLM apps. Write Python functions, use EvalContext to track results, and Twevals handles storage, scoring, and a small web UI.

Installation

Twevals is intended as a development dependency.

pip install twevals
# or with uv
uv add --dev twevals

Quick start

Look at the examples directory for runnable snippets. Run the demo suite and open the UI:

twevals examples --serve

UI screenshot

UI highlights

Expand rows to see inputs, outputs, metadata, scores, and annotations.
Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.
Actions menu: refresh, rerun the suite, export JSON/CSV.

Common flags: --dataset, --label, -c/--concurrency, -q/--quiet, -v/--verbose. Serve-specific: --serve, --dev, --host, --port.

Authoring evals

Write evals like tests. Add a ctx: EvalContext parameter, and Twevals auto-injects a mutable context object for building your evaluation.

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    dataset="customer_service",
    default_score_key="correctness"
)
async def test_refund(ctx: EvalContext):
    # ctx.input already set from decorator
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == "expected refund response", "Validation")
    # No return needed - decorator auto-returns!

EvalContext

EvalContext is a mutable builder that makes writing evals clean and intuitive. When your function has a parameter with type annotation : EvalContext, Twevals automatically injects an instance.

Key features:

Auto-injection: Just add ctx: EvalContext parameter
Smart methods: add_output(), add_score(), set_params()
Auto-return: No explicit return needed
IDE support: Full type hints and autocomplete
Incremental building: Set fields as you get them
Exception safety: Partial data preserved on errors

Core methods:

# Smart output extraction
ctx.add_output({"output": "result", "latency": 0.5, "run_data": {...}})
# Or simple value
ctx.add_output("simple output")

# Flexible scoring
ctx.add_score(True, "Test passed")  # Boolean with default key
ctx.add_score(0.95, "High score", key="similarity")  # Numeric with custom key
ctx.add_score(key="detailed", passed=True, value=0.98, notes="...")  # Full control

# Note: add_score() is optional! If you never call it, the test automatically
# passes with the default score key. Just like pytest - if your test runs
# through without errors, it passes.

# Helper for parametrize
ctx.set_params(model="gpt-4", temperature=0.7)  # Sets both input and metadata

Direct field access:

ctx.input = "test input"
ctx.reference = "expected output"
ctx.metadata = {"model": "gpt-4"}
# ... and more: output, latency, run_data, error

Writing your first eval

The cleanest pattern sets everything you can in the decorator:

from twevals import eval, EvalContext

@eval(
    input="I want a refund",
    reference="I'll help you process your refund request.",
    dataset="customer_service",
    default_score_key="correctness",
    metadata={"model": "gpt-4", "version": "1.0"}
)
async def test_refund_request(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Output validation")

Common patterns

1) Set input in function (more dynamic):

@eval(dataset="greetings", default_score_key="politeness")
async def test_greeting(ctx: EvalContext):
    ctx.input = "Hello there"
    ctx.reference = fetch_expected_greeting()

    ctx.add_output(await run_agent(ctx.input))
    ctx.add_score(ctx.output == ctx.reference, "Match check")

2) Smart field extraction:

@eval(dataset="qa", default_score_key="accuracy")
async def test_question(ctx: EvalContext):
    ctx.input = "What is the capital of France?"
    ctx.reference = "Paris"

    # Extracts output, latency, run_data, metadata from dict
    ctx.add_output(await run_agent(ctx.input))

    ctx.add_score(ctx.reference.lower() in ctx.output.lower(), "Contains answer")

3) Multiple scores:

@eval(dataset="qa", default_score_key="exact_match")
async def test_multi_score(ctx: EvalContext):
    ctx.input = "What is 2+2?"
    ctx.reference = "4"
    ctx.add_output(await run_agent(ctx.input))

    # Boolean score with default key
    ctx.add_score(ctx.reference in ctx.output, "Exact match")

    # Numeric score with custom key
    similarity = calculate_similarity(ctx.output, ctx.reference)
    ctx.add_score(similarity, "Similarity score", key="similarity")

    # Full control
    ctx.add_score(
        key="confidence",
        value=0.95,
        passed=True,
        notes="High confidence prediction"
    )

4) Explicit return (optional):

@eval(dataset="test")
async def test_explicit(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("output")
    ctx.add_score(True, "Passed", key="test")
    return ctx  # Optional - decorator auto-converts to EvalResult

`@eval` decorator

Wraps a function and records evaluation results.

Parameters:

dataset (str, optional): Groups related evals (defaults to filename)
labels (list, optional): Filtering tags
evaluators (list, optional): Callables that add scores to a result
target (callable, optional): Pre-hook that runs before the eval, populating the EvalContext
input (any, optional): Pre-populate ctx.input
reference (any, optional): Pre-populate ctx.reference
default_score_key (str, optional): Default key for add_score()
metadata (dict, optional): Pre-populate ctx.metadata
metadata_from_params (list, optional): Auto-extract params to metadata
timeout (float, optional): Maximum execution time in seconds for the evaluation

Examples:

# Minimal
@eval()
def test(ctx: EvalContext):
    ...

# With defaults
@eval(
    dataset="my_tests",
    default_score_key="correctness",
    metadata={"version": "1.0"}
)
def test(ctx: EvalContext):
    ...

# Pre-populated input/reference
@eval(
    input="test input",
    reference="expected",
    dataset="static_tests"
)
def test(ctx: EvalContext):
    # ctx.input and ctx.reference already set!
    ...

# Target hook to run your agent and inject results
def call_agent(ctx: EvalContext):
    # Use any attributes you like on the context
    ctx.trace_id = "abc123"
    ctx.add_output(my_agent(ctx.input), metadata={"trace_id": ctx.trace_id})

@eval(
    target=call_agent,
    input="What is the weather?",
    dataset="agent_calls",
)
def test_with_target(ctx: EvalContext):
    # ctx.output comes from the target hook, ctx.trace_id is preserved
    ctx.add_score("weather" in ctx.output.lower(), notes="Contains answer")
    return ctx.build()

# With timeout to prevent long-running evals
@eval(
    input="complex task",
    timeout=5.0,  # Fails if execution exceeds 5 seconds
    dataset="performance"
)
async def test_with_timeout(ctx: EvalContext):
    ctx.add_output(await slow_agent(ctx.input))
    ctx.add_score(ctx.output is not None, "Completed in time")

If your target returns a value, it is treated as ctx.output by default (dicts are passed to ctx.add_output()).

File-level defaults

Set global properties for all tests in a file using twevals_defaults (similar to pytest's pytestmark):

# Set defaults at the top of your file
twevals_defaults = {
    "dataset": "sentiment_analysis",
    "labels": ["production", "nlp"],
    "default_score_key": "accuracy",
    "metadata": {"model": "gpt-4", "version": "v1.0"}
}

@eval  # Inherits all defaults
def test_positive():
    ...

@eval(labels=["experimental"])  # Override just labels
def test_edge_case():
    ...

Priority: Decorator parameters > File defaults > Built-in defaults

Supported parameters: All @eval decorator parameters including dataset, labels, evaluators, target, input, reference, default_score_key, metadata, and metadata_from_params.

Deep merge: When both file and decorator specify metadata, they are merged (decorator values win on conflicts).

`@parametrize`

Generate multiple evals from one function. Place @eval above @parametrize.

Auto-mapping magic:

When parameter names match EvalContext fields (input, reference, metadata, etc.), they automatically populate the context:

from twevals import parametrize

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay I guess", "neutral"),
])
def test_sentiment(ctx: EvalContext):
    # ctx.input and ctx.reference auto-populated! ✨

    detected = analyze_sentiment(ctx.input)
    ctx.add_output(detected)
    ctx.add_score(ctx.output == ctx.reference, f"Detected: {detected}")

# Parametrize + targets: param sets are available to the target via ctx.input/ctx.metadata
def call_agent(ctx: EvalContext):
    ctx.add_output(my_agent(ctx.input["prompt"]))

@eval(target=call_agent)
@parametrize("prompt", ["hello", "world"])
def test_prompt(ctx: EvalContext):
    assert "prompt" in ctx.input  # set before target runs
    return ctx.build()

Custom parameters:

@eval(dataset="math", default_score_key="correctness")
@parametrize("operation,a,b,expected", [
    ("add", 2, 3, 5),
    ("multiply", 4, 7, 28),
])
def test_calculator(ctx: EvalContext, operation, a, b, expected):
    ctx.input = {"operation": operation, "a": a, "b": b}
    ctx.reference = expected

    ops = {"add": lambda x, y: x + y, "multiply": lambda x, y: x * y}
    result = ops[operation](a, b)

    ctx.add_output(result)
    ctx.add_score(result == expected, f"{a} {operation} {b} = {result}")

Common patterns:

# 1) Single parameter with IDs
@eval(dataset="thresholds")
@parametrize("threshold", [0.2, 0.5, 0.8], ids=["low", "mid", "high"])
def test_threshold(ctx: EvalContext, threshold):
    ctx.input = threshold
    ctx.add_output(evaluate(threshold))
    ctx.add_score(ctx.output > threshold, "Above threshold")

# 2) Cartesian product (stacked parametrize)
@eval(dataset="models", default_score_key="quality")
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("temperature", [0.0, 0.7, 1.0])
def test_model_grid(ctx: EvalContext, model, temperature):
    ctx.set_params(model=model, temperature=temperature)  # Sets input and metadata
    ctx.add_output(run_model(model, temperature))
    ctx.add_score(score_output(ctx.output), f"Model: {model}")

# 3) Dictionaries for named arguments
@eval(dataset="auth")
@parametrize("username,password,should_succeed", [
    {"username": "alice", "password": "correct", "should_succeed": True},
    {"username": "alice", "password": "wrong", "should_succeed": False},
])
def test_login(ctx: EvalContext, username, password, should_succeed):
    ctx.input = {"username": username}
    result = login(username, password)
    ctx.add_output(result)
    ctx.add_score(result.success == should_succeed, "Login check", key="auth")

Notes:

Accepts tuples, dicts, or single values
Works with sync or async functions
Put @eval above @parametrize
Parameter names matching input, reference, etc. auto-populate context

See more patterns in examples/new_demo.py.

Advanced patterns

Assertion preservation

Assertions are treated as validation failures and create failing scores:

@eval(dataset="validation", default_score_key="correctness")
async def test_with_assertion(ctx: EvalContext):
    ctx.input = "test"
    ctx.reference = "expected"
    ctx.metadata = {"model": "gpt-4"}

    ctx.add_output(await run_agent(ctx.input))

    # If this fails, a failing score is added with the assertion message
    # All data (input/output/reference/metadata) is preserved
    assert ctx.output == ctx.reference, "Output mismatch"

    ctx.add_score(True, "All checks passed")

Context manager pattern

For explicit control:

@eval(dataset="test")
async def test_with_context_manager():
    with EvalContext(input="test", default_score_key="accuracy") as ctx:
        ctx.add_output(await run_agent(ctx.input))
        ctx.add_score(True, "Passed")
        return ctx  # Explicit return

Ultra-minimal pattern

The absolute shortest eval (2 lines!):

@eval(dataset="sentiment", default_score_key="accuracy")
@parametrize("input,reference", [("I love this!", "positive"), ("Terrible!", "negative")])
def test(ctx: EvalContext):
    ctx.add_output(analyze(ctx.input))
    ctx.add_score(ctx.output == ctx.reference)

Reference

EvalResult schema

EvalContext automatically builds an EvalResult object when the evaluation completes. You can also return EvalResult directly if you prefer:

from twevals import EvalResult

@eval(dataset="test")
def test_direct():
    return EvalResult(
        input="...",          # required: test input
        output="...",         # required: system output
        reference="...",      # optional: expected output
        error=None,           # optional: error message
        latency=0.123,        # optional: execution time (auto-calculated if not provided)
        metadata={"model": "gpt-4"},  # optional: metadata for filtering
        run_data={"trace": [...]},     # optional: debug data
        scores={"key": "exact", "passed": True},  # scores dict or list
    )

Score schema

{
    "key": "metric_name",    # required: Name of the metric
    "value": 0.95,           # optional: Numeric score
    "passed": True,          # optional: Boolean pass/fail
    "notes": "...",          # optional: Justification
}
# Must provide at least one of: value or passed

scores accepts a single dict, a list of dicts, or a list of Score objects.

Evaluators

Callables that add scores to results after execution:

def custom_evaluator(result):
    """Returns Score object, dict, or list of either"""
    if result.reference in result.output.lower():
        return {"key": "contains_ref", "passed": True}
    return {"key": "contains_ref", "passed": False}

@eval(dataset="test", evaluators=[custom_evaluator])
def test_with_evaluator(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("test output")
    # custom_evaluator runs after, adds score

Headless runs

Skip the UI and save results to disk:

twevals path/to/evals
# Run specific function: twevals path/to/evals.py::function_name
# Run parametrized variant: twevals path/to/evals.py::function_name[param_id]
# Filtering and other common flags work here as well

Run-only flags: -o/--output (save JSON summary), --csv (save CSV), --json (output compact JSON to stdout), --list (list evaluations without running), --limit (limit number of evals).

Sessions and runs

Group related eval runs together using sessions. This enables workflows like model comparison, iterative debugging, and tracking progress across multiple runs.

Basic usage

# Named session and run
twevals examples --serve --session model-upgrade --run-name gpt5-baseline

# Continue a session (same name = same session)
twevals examples --serve --session model-upgrade --run-name gpt5-tuned

# Auto-generated friendly names (e.g., "swift-falcon", "bright-flame")
twevals examples --serve

How it works

Session: A grouping of related runs identified by name. Same --session X = same session.
Run: A single execution of evals. Each run creates a new JSON file.
File naming: {run_name}_{timestamp}.json (e.g., gpt5-baseline_2025-11-29T15-30-00Z.json)
Auto-naming: When not specified, friendly adjective-noun names are generated.

UI display

The stats bar shows the current session and run:

SESSION model-upgrade · RUN gpt5-baseline | TESTS 50 | ACCURACY 45/50 | ...

Storage structure

.twevals/runs/
  gpt5-baseline_2025-11-29T15-30-00Z.json   # named run
  swift-falcon_2025-11-29T15-35-00Z.json    # auto-generated name
  latest.json                                # copy of most recent

JSON schema

Each run file includes session metadata:

{
  "session_name": "model-upgrade",
  "run_name": "gpt5-baseline",
  "run_id": "2025-11-29T15-30-00Z",
  "total_evaluations": 50,
  "results": [...]
}

API endpoints

When running in serve mode, these endpoints are available:

GET /api/sessions - List all unique session names
GET /api/sessions/{name}/runs - List runs for a session
PATCH /api/runs/{run_id} - Update run metadata (e.g., rename)

CLI reference

twevals <path>                  # run evals (default behavior)
twevals <path> --serve          # run evals and launch web UI
twevals <path>::<function>      # run specific function (e.g., tests.py::my_eval)

Common flags:
  -d, --dataset TEXT      Filter by dataset(s)
  -l, --label TEXT        Filter by label(s)
  -c, --concurrency INT   Number of concurrent evals (0 = sequential)
  --timeout FLOAT         Global timeout in seconds (overrides individual test timeouts)
  -q, --quiet             Reduce logs
  -v, --verbose           Verbose logs

Run flags:
  -o, --output FILE       Save JSON summary
  --csv FILE              Save CSV results
  --json                  Output compact JSON to stdout (machine-readable)
  --list                  List evaluations without running
  --limit INT             Limit number of evaluations to run

Session flags (use with --serve):
  --session TEXT          Session name to group runs together
  --run-name TEXT         Name for this run (used as file prefix)

Serve flags (use with --serve):
  --dev                   Enable hot reload
  --host TEXT             Host interface (default 127.0.0.1)
  --port INT              Port (default 8000)

Contributing

uv sync
uv run pytest -q
uv run ruff check twevals tests
uv run black .

Helpful demo:

uv run twevals examples --serve

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.0.dev20251206175718 pre-release

Dec 6, 2025

0.0.0.dev20251206035314 pre-release

Dec 6, 2025

0.0.0.dev20251201163021 pre-release

Dec 1, 2025

This version

0.0.0.dev20251201015127 pre-release

Dec 1, 2025

0.0.0.dev20251130220520 pre-release

Nov 30, 2025

0.0.0.dev20251129224953 pre-release

Nov 29, 2025

0.0.0.dev20251129011453 pre-release

Nov 29, 2025

0.0.0.dev20251126194642 pre-release

Nov 26, 2025

0.0.0.dev20251126194341 pre-release

Nov 26, 2025

0.0.0.dev20251123210743 pre-release

Nov 23, 2025

0.0.0.dev20251123201551 pre-release

Nov 23, 2025

0.0.0.dev20251123180314 pre-release

Nov 23, 2025

0.0.0.dev20251123171515 pre-release

Nov 23, 2025

0.0.0.dev20251123034557 pre-release

Nov 23, 2025

0.0.0.dev20251123033650 pre-release

Nov 23, 2025

0.0.0.dev20251123031335 pre-release

Nov 23, 2025

0.0.0.dev20251123030249 pre-release

Nov 23, 2025

0.0.0.dev20251122223407 pre-release

Nov 22, 2025

0.0.0.dev20251122221025 pre-release

Nov 22, 2025

0.0.0.dev20251122212102 pre-release

Nov 22, 2025

0.0.0.dev20251122210604 pre-release

Nov 22, 2025

0.0.0.dev20251122210529 pre-release

Nov 22, 2025

0.0.0.dev20251122202626 pre-release

Nov 22, 2025

0.0.0.dev20251122201355 pre-release

Nov 22, 2025

0.0.0.dev20251122195847 pre-release

Nov 22, 2025

0.0.0.dev20250904233630 pre-release

Sep 4, 2025

0.0.0.dev20250904214507 pre-release

Sep 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twevals-0.0.0.dev20251201015127.tar.gz (9.0 MB view details)

Uploaded Dec 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

twevals-0.0.0.dev20251201015127-py3-none-any.whl (1.4 MB view details)

Uploaded Dec 1, 2025 Python 3

File details

Details for the file twevals-0.0.0.dev20251201015127.tar.gz.

File metadata

Download URL: twevals-0.0.0.dev20251201015127.tar.gz
Upload date: Dec 1, 2025
Size: 9.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for twevals-0.0.0.dev20251201015127.tar.gz
Algorithm	Hash digest
SHA256	`f2c19631504ab1df12b86921283bbe8b3798a686f689fa0b70c959696d41e5d3`
MD5	`3621f86c3328e0635aa0420cef8624c3`
BLAKE2b-256	`35bbd3469ffd95f594292f1552f57a6e61e90b739d269664081f095d7eb6c883`

See more details on using hashes here.

File details

Details for the file twevals-0.0.0.dev20251201015127-py3-none-any.whl.

File metadata

Download URL: twevals-0.0.0.dev20251201015127-py3-none-any.whl
Upload date: Dec 1, 2025
Size: 1.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for twevals-0.0.0.dev20251201015127-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8748e8f4b8d30ce3377974cc2cd304cd6676e0e4f6f96293796cea42a385e2f`
MD5	`39dad87af2aef07ece64b7bdabba8d5c`
BLAKE2b-256	`ef71064a6a00f9a258471a7d601d56f299db845bd7af01659f8d7a06471af2a0`

See more details on using hashes here.

twevals 0.0.0.dev20251201015127

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Twevals

Installation

Quick start

UI highlights

Authoring evals

EvalContext

Writing your first eval

Common patterns

@eval decorator

File-level defaults

@parametrize

Advanced patterns

Assertion preservation

Context manager pattern

Ultra-minimal pattern

Reference

EvalResult schema

Score schema

Evaluators

Headless runs

Sessions and runs

Basic usage

How it works

UI display

Storage structure

JSON schema

API endpoints

CLI reference

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`@eval` decorator

`@parametrize`