A lightweight, code-first evaluation framework for testing AI agents and LLM applications

These details have not been verified by PyPI

Project links

Project description

Twevals

Lightweight evals for AI agents and LLM apps. Write Python functions alongside your code, return an EvalResult, and Twevals handles storage, scoring, and a small web UI.

Installation

Twevals is intended as a development dependency.

pip install twevals
# or with Poetry
poetry add --group dev twevals

Quick start

Look at the examples directory for runnable snippets. Run the demo suite and open the UI:

twevals serve examples

UI screenshot

UI highlights

Expand rows to see inputs, outputs, metadata, scores, and annotations.
Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.
Actions menu: refresh, rerun the suite, export JSON/CSV.

Common serve flags: --dataset, --label, -c/--concurrency, --dev, --host, --port, -q/--quiet, -v/--verbose.

Authoring evals

Write evals like tests; return EvalResult.

from twevals import eval, EvalResult

@eval(dataset="customer_service")
def test_refund_request():
    output = run_agent("I want a refund")
    return EvalResult(
        input="I want a refund",
        output=output,
        reference="refund",
        scores={"key": "keyword", "passed": "refund" in output.lower()},
    )

EvalResult

The EvalResult object is used to store the result of an eval. It is returned by the @eval decorator.

EvalResult(
    input="...",          # required: prompt or test input. Can be a string, a dict, or a list.
    output="...",         # required: model/agent output. Can be a string, a dict, or a list.
    reference="...",      # optional expected output.
    error=None,            # optional error message. Assert errors will automatically be added to the result.
    latency=0.123,         # optional execution time. Latency is automatically calculated if not provided.
    metadata={"model": "gpt-4"},  # optional metadata for filtering and tracking
    run_data={"trace": [...]},     # optional extra JSON stored with result. Good place to store the trace for debugging.
    scores={"key": "exact", "passed": True},  # scores dict or list of dicts;
)

Twevals allows you to use a pass/fail score, a numeric score, or a combination of both. You can also add justification to the score in the notes field.

The Score schema for scores items is:

{
    "key": "metric",        # required: Name of the metric
    "value": 0.42,           # optional numeric metric
    "passed": True,          # optional boolean metric
    "notes": "optional",     # optional notes
}
# Provide at least one of: value or passed

scores accepts a single dict or a list of dicts/Score objects; Twevals normalizes both forms.

`@eval` decorator

Wraps a function and records returned EvalResult objects.

Parameters:

dataset (defaults to filename)
labels (filtering tags)
evaluators (callables that add scores to a result)

`@parametrize`

Generate multiple evals from one function. Place @eval above @parametrize.

from twevals import parametrize

@eval(dataset="customer_service")
@parametrize("prompt,expected", [
    ("I want a refund", "refund"),
    ("Can I get my money back?", "refund"),
])
def test_refund(prompt, expected):
    output = run_agent(prompt)
    return EvalResult(
        input=prompt,
        output=output,
        reference=expected,
    )

Common patterns:

# 1) Single parameter values (with optional ids)
@eval(dataset="math")
@parametrize("n", [1, 2, 3], ids=["small", "medium", "large"])
def test_square(n):
    out = n * n
    return EvalResult(input=n, output=out, reference=n**2,
                      scores={"key": "exact", "passed": out == n**2})

# 2) Multiple parameters via tuples
@eval(dataset="auth")
@parametrize("username,password,ok", [
    ("alice", "correct", True),
    ("alice", "wrong", False),
])
def test_login(username, password, ok):
    out = fake_login(username, password)
    return EvalResult(input={"u": username}, output=out,
                      scores={"key": "ok", "passed": out is ok})

# 3) Dictionaries for named argument sets
@eval(dataset="calc")
@parametrize("op,a,b,expected", [
    {"op": "add", "a": 2, "b": 3, "expected": 5},
    {"op": "mul", "a": 4, "b": 7, "expected": 28},
])
def test_calc(op, a, b, expected):
    ops = {"add": lambda x, y: x + y, "mul": lambda x, y: x * y}
    result = ops[op](a, b)
    return EvalResult(input={"op": op, "a": a, "b": b}, output=result,
                      reference=expected,
                      scores=[{"key": "correct", "passed": result == expected})]

# 4) Stacked parametrize (cartesian product); ids combine like "model-temp"
@eval(dataset="models")
@parametrize("model", ["gpt-4", "gpt-3.5"], ids=["g4", "g35"])
@parametrize("temperature", [0.0, 0.7])
def test_model_grid(model, temperature):
    out = run(model=model, temperature=temperature)
    return EvalResult(input={"model": model, "temperature": temperature}, output=out)

# 5) Single-name shorthand accepts single values
@eval(dataset="thresholds")
@parametrize("threshold", [0.2, 0.5, 0.8])
def test_threshold(threshold=0.5):
    out = evaluate(threshold=threshold)
    return EvalResult(input=threshold, output=out)

Notes:

Accepts tuples, dicts, or single values (for one parameter).
Works with sync or async functions.
Put @eval above @parametrize so Twevals can attach dataset/labels.

See more patterns in examples/demo_eval_paramatrize.py.

Headless runs

Skip the UI and save results to disk:

twevals run path/to/evals
# Filtering and other common flags work here as well

run-only flags: -o/--output (save JSON summary), --csv (save CSV).

CLI reference

twevals serve <path>   # run evals once and launch the web UI
twevals run <path>     # run without UI

Common flags:
  -d, --dataset TEXT      Filter by dataset(s)
  -l, --label TEXT        Filter by label(s)
  -c, --concurrency INT   Number of concurrent evals (0 = sequential)
  -q, --quiet             Reduce logs
  -v, --verbose           Verbose logs

serve-only:
  --dev                   Enable hot reload
  --host TEXT             Host interface (default 127.0.0.1)
  --port INT              Port (default 8000)

run-only:
  -o, --output FILE       Save JSON summary
  --csv FILE              Save CSV results

Contributing

poetry install
poetry run pytest -q
poetry run ruff check twevals tests
poetry run black .

Helpful demo:

poetry run twevals serve examples

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.0.dev20251206175718 pre-release

Dec 6, 2025

0.0.0.dev20251206035314 pre-release

Dec 6, 2025

0.0.0.dev20251201163021 pre-release

Dec 1, 2025

0.0.0.dev20251201015127 pre-release

Dec 1, 2025

0.0.0.dev20251130220520 pre-release

Nov 30, 2025

0.0.0.dev20251129224953 pre-release

Nov 29, 2025

0.0.0.dev20251129011453 pre-release

Nov 29, 2025

0.0.0.dev20251126194642 pre-release

Nov 26, 2025

0.0.0.dev20251126194341 pre-release

Nov 26, 2025

0.0.0.dev20251123210743 pre-release

Nov 23, 2025

0.0.0.dev20251123201551 pre-release

Nov 23, 2025

0.0.0.dev20251123180314 pre-release

Nov 23, 2025

0.0.0.dev20251123171515 pre-release

Nov 23, 2025

0.0.0.dev20251123034557 pre-release

Nov 23, 2025

0.0.0.dev20251123033650 pre-release

Nov 23, 2025

0.0.0.dev20251123031335 pre-release

Nov 23, 2025

0.0.0.dev20251123030249 pre-release

Nov 23, 2025

0.0.0.dev20251122223407 pre-release

Nov 22, 2025

0.0.0.dev20251122221025 pre-release

Nov 22, 2025

0.0.0.dev20251122212102 pre-release

Nov 22, 2025

0.0.0.dev20251122210604 pre-release

Nov 22, 2025

0.0.0.dev20251122210529 pre-release

Nov 22, 2025

0.0.0.dev20251122202626 pre-release

Nov 22, 2025

0.0.0.dev20251122201355 pre-release

Nov 22, 2025

0.0.0.dev20251122195847 pre-release

Nov 22, 2025

This version

0.0.0.dev20250904233630 pre-release

Sep 4, 2025

0.0.0.dev20250904214507 pre-release

Sep 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twevals-0.0.0.dev20250904233630.tar.gz (30.8 kB view details)

Uploaded Sep 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

twevals-0.0.0.dev20250904233630-py3-none-any.whl (33.7 kB view details)

Uploaded Sep 4, 2025 Python 3

File details

Details for the file twevals-0.0.0.dev20250904233630.tar.gz.

File metadata

Download URL: twevals-0.0.0.dev20250904233630.tar.gz
Upload date: Sep 4, 2025
Size: 30.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for twevals-0.0.0.dev20250904233630.tar.gz
Algorithm	Hash digest
SHA256	`91ac18e35c214536e5161cfb5fcd837a7379527a8be415f46a733e95de4b3f50`
MD5	`4e3fd63737f1fc73f989fa686df8ca11`
BLAKE2b-256	`5ce6944f955078ff67e02899ed9d16e208f4cf2095b7149aee245565cf93ea60`

See more details on using hashes here.

File details

Details for the file twevals-0.0.0.dev20250904233630-py3-none-any.whl.

File metadata

Download URL: twevals-0.0.0.dev20250904233630-py3-none-any.whl
Upload date: Sep 4, 2025
Size: 33.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for twevals-0.0.0.dev20250904233630-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3292ca67ae7bb90c0b3634be0db9e528181df9f825b483ac80ca168ca6111318`
MD5	`50dfd4754ada4eb4a7e1a51701cf3ec5`
BLAKE2b-256	`fb7741f501ae89168412f231c98ef36cf1da57480d09e7fb5f7598a5f323847f`

See more details on using hashes here.

twevals 0.0.0.dev20250904233630

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Twevals

Installation

Quick start

UI highlights

Authoring evals

EvalResult

`@eval` decorator

`@parametrize`

Headless runs

CLI reference

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

twevals 0.0.0.dev20250904233630

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Twevals

Installation

Quick start

UI highlights

Authoring evals

EvalResult

@eval decorator

@parametrize

Headless runs

CLI reference

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`@eval` decorator

`@parametrize`