A lightweight, code-first evaluation framework for testing AI agents and LLM applications

These details have not been verified by PyPI

Project links

Project description

Twevals

Code‑first evaluation for AI agents and LLM apps. This README focuses on concrete, copy‑pasteable examples and CLI usage. See examples/ for runnable demos.

twevals serve UI

Install

pip install twevals   # Python 3.10+

Develop this repo (Poetry):

poetry install

Quick Start: Serve the Web UI (recommended)

Spin up the UI, run evals once, and browse the results:

poetry run twevals serve examples

Useful options:

# Filter by dataset/labels
poetry run twevals serve examples --dataset customer_service
poetry run twevals serve examples --label production --label test

# Concurrency and dev hot‑reload
poetry run twevals serve examples -c 4 --dev

# Quiet logs (access logs off)
poetry run twevals serve examples --quiet

Tip: Use the in‑UI Actions ▾ menu to Refresh, Rerun the full suite, and Export JSON/CSV.

Browse Results in a Web UI

Launch a lightweight FastAPI app to view results in your browser. Each serve run executes evals once, saves a fresh JSON under .twevals/runs/, and the UI reads from that file on refresh:

poetry run twevals serve examples
# Options:
#   -d, --dataset TEXT      Filter by dataset(s) (comma-separated)
#   -l, --label TEXT        Filter by label(s) (repeatable)
#   -c, --concurrency INT   Number of concurrent evals (0 = sequential)
#       --dev               Enable hot‑reload (dev UX; watches repo)
#       --host TEXT         Host interface (default 127.0.0.1)
#       --port INT          Port (default 8000)
#   -v, --verbose           Verbose server logs
#   -q, --quiet             Reduce logs; hide access logs

Results storage and UI:

Saves to .twevals/runs/<YYYY-MM-DDTHH-MM-SSZ>.json and a portable copy at .twevals/runs/latest.json.
UI loads results from JSON on every refresh; external edits are reflected.
Inline editing via API: PATCH /api/runs/{run_id}/results/{index} for dataset, labels, and result.{scores,metadata,error,reference,annotation}.

UI features:

Expandable rows with rich detail panels (input/output/reference, metadata JSON, run data, scores, annotation).
Inline editing: edit dataset, labels, metadata JSON, scores (key/value/passed/notes), and a free‑form annotation; changes persist to JSON.
Actions menu: Refresh, Rerun full suite, Export JSON, Export CSV.
Sortable headers: click to sort; Shift+click to multi‑sort.
Column toggles + resizable columns; choices and widths persist via localStorage; quick reset controls for columns/sorting/widths.
Polished table styling with latency badges and label chips.

Dev mode:

poetry run twevals serve examples --dev enables hot‑reload for code/templates; useful while iterating on evals or the UI.

Minimal Eval (sync)

Create any .py file and decorate a function. Dataset defaults to the filename; labels are optional.

from twevals import eval, EvalResult

@eval  # dataset inferred from file name
def test_single_case():
    return EvalResult(
        input="Hi there",
        output="Hello! How can I help you today?",
    )

Run it:

poetry run twevals run path/to/that_file.py

Or browse it in the UI:

poetry run twevals serve path/to/that_file.py

Returning Many Results From One Function

Return a list of EvalResult if you want to iterate your own test cases inside a single eval (good for small, hand‑rolled suites).

from twevals import eval, EvalResult

@eval(dataset="customer_service", labels=["production"])
def test_refund_requests():
    test_cases = [
        ("I want a refund", "refund"),
        ("Money back please", "refund"),
        ("Cancel and refund", "refund"),
    ]

    results = []
    for prompt, expected_keyword in test_cases:
        output = f"Processing {prompt} …"
        results.append(
            EvalResult(
                input=prompt,
                output=output,
                reference=expected_keyword,
                scores={"key": "keyword_match", "passed": expected_keyword in output.lower()},
            )
        )
    return results

See: examples/demo_eval.py (also shows async + custom latency).

Parametrized Evals (pytest‑style)

Use @parametrize to automatically generate one eval per case (helps with reporting and filters). Stack @parametrize decorators for a cartesian product.

from twevals import eval, EvalResult, parametrize

@eval(dataset="sentiment_analysis")
@parametrize("text,expected", [
    ("I love this product!", "positive"),
    ("This is terrible", "negative"),
])
def test_sentiment(text, expected):
    detected = "positive" if "love" in text.lower() else "negative"
    return EvalResult(
        input=text,
        output=detected,
        reference=expected,
        scores={"key": "accuracy", "passed": detected == expected},
    )

Use dicts and IDs:

@eval(dataset="math_operations", labels=["unit_test"])
@parametrize("operation,a,b,expected", [
    {"operation": "add", "a": 2, "b": 3, "expected": 5},
    {"operation": "multiply", "a": 4, "b": 7, "expected": 28},
])
def test_calculator(operation, a, b, expected):
    ops = {"add": lambda x,y: x+y, "multiply": lambda x,y: x*y}
    result = ops[operation](a, b)
    return EvalResult(
        input={"operation": operation, "a": a, "b": b},
        output=result,
        reference=expected,
        scores={"key": "correctness", "passed": result == expected},
    )

@eval(dataset="qa_system")
@parametrize(
    "question,context,expected",
    [
        ("What is the capital of France?", "France …", "Paris"),
        ("Who wrote Romeo and Juliet?", "Shakespeare …", "Shakespeare"),
    ],
    ids=["geography", "literature"],
)
def test_qa(question, context, expected):
    # …
    return EvalResult(input={"q": question, "ctx": context}, output="Paris", reference=expected)

Cartesian product and async:

@eval(dataset="model_comparison")
@parametrize("model", ["gpt-4.1", "gpt-5"])
@parametrize("temperature", [0.0, 1.0])
async def test_model_temperatures(model, temperature):
    return EvalResult(
        input={"model": model, "temperature": temperature},
        output=f"Response from {model} at {temperature}",
        scores={"key": "creativity", "value": min(temperature * 0.8 + 0.2, 1.0)},
        metadata={"model": model, "temperature": temperature},
    )

See: examples/demo_eval_paramatrize.py.

Async Evals and Latency

@eval works with sync and async functions. The decorator measures function execution time and fills latency if you don’t set it. If you measure just your model/agent call, set latency on each EvalResult yourself.

import asyncio, time
from twevals import eval, EvalResult

async def run_agent(prompt: str):
    start = time.time()
    await asyncio.sleep(0.2)  # simulate call
    return EvalResult(input=prompt, output="ok", latency=time.time() - start)

@eval(dataset="customer_service")
async def test_refund_requests():
    return [await run_agent("I want a refund")]  # list of EvalResult

Custom Evaluators (attach scores programmatically)

You can provide evaluators=[...] to @eval(...). Each evaluator receives an EvalResult and can return a Score-like dict, a list of scores, or a new EvalResult. Returned scores are appended to the result.

from twevals import eval, EvalResult

def reference_match(result: EvalResult):
    ok = result.reference and str(result.reference).lower() in str(result.output).lower()
    return {"key": "reference_match", "passed": bool(ok)}

@eval(dataset="sentiment_analysis", evaluators=[reference_match])
def test_case():
    return EvalResult(input="I love it", output="positive", reference="positive")

See: examples/demo_eval_paramatrize.py (first example).

Organizing and Discovering Evals

Put evals in .py files; discovery ignores files starting with _.
If you don’t pass dataset=..., it defaults to the filename.
Use short, lowercase labels (e.g., prod, smoke) for filtering.

Run by path or file:

poetry run twevals run tests/
poetry run twevals run examples/demo_eval.py

Filter by dataset/label:

poetry run twevals run tests/ --dataset my_dataset
poetry run twevals run tests/ --label prod --label smoke

Save results and inspect:

poetry run twevals run tests/ -o results.json
cat results.json  # contains summary + all results

Result Shape (EvalResult)

EvalResult is Pydantic‑validated. Scores can be a single score (dict) or a list of scores.

from twevals import EvalResult

EvalResult(
    input="...",          # required: Input that was used to generate the output
    output="...",         # required: Output that was generated
    reference="...",      # optional: Expected output
    scores=[               # optional (dict or list of dicts): Evaluation results
        {"key": "accuracy", "value": 0.93},
        {"key": "pass", "passed": True, "notes": "ok"},
    ],
    error=None,            # optional: Error message (string)
    latency=0.123,         # optional: Execution time in seconds
    metadata={"model": "gpt-4"},  # optional: Additional custom data
    run_data={"attempts": 3},     # optional: Extra run‑specific JSON stored and shown in UI
)

Notes:

Each score must include either value (numeric) or passed (boolean). notes is optional and shown in the UI.
The UI also supports saving a single free‑form annotation per result via the web editor/API; this is persisted in the JSON results.

CLI Reference (common)

# Serve and browse (recommended)
twevals serve examples --quiet
# Open http://127.0.0.1:8000

# Headless run (save to files, no UI)
twevals run path/or/file.py

# Filter
twevals run tests/ --dataset my_dataset
twevals run tests/ --label prod --label smoke

# Concurrency, verbose, save JSON/CSV
twevals run tests/ -c 4 -v -o results.json --csv results.csv

Developing

poetry install
poetry run pytest -q
poetry run pytest --cov=twevals  # coverage
poetry run ruff check twevals tests
poetry run black .

Helpful demo entry-point:

poetry run twevals serve examples

For deeper module internals, see twevals/README.md. The tests under tests/ demonstrate discovery, filtering, CLI options, async handling, and formatting.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.0.dev20251206175718 pre-release

Dec 6, 2025

0.0.0.dev20251206035314 pre-release

Dec 6, 2025

0.0.0.dev20251201163021 pre-release

Dec 1, 2025

0.0.0.dev20251201015127 pre-release

Dec 1, 2025

0.0.0.dev20251130220520 pre-release

Nov 30, 2025

0.0.0.dev20251129224953 pre-release

Nov 29, 2025

0.0.0.dev20251129011453 pre-release

Nov 29, 2025

0.0.0.dev20251126194642 pre-release

Nov 26, 2025

0.0.0.dev20251126194341 pre-release

Nov 26, 2025

0.0.0.dev20251123210743 pre-release

Nov 23, 2025

0.0.0.dev20251123201551 pre-release

Nov 23, 2025

0.0.0.dev20251123180314 pre-release

Nov 23, 2025

0.0.0.dev20251123171515 pre-release

Nov 23, 2025

0.0.0.dev20251123034557 pre-release

Nov 23, 2025

0.0.0.dev20251123033650 pre-release

Nov 23, 2025

0.0.0.dev20251123031335 pre-release

Nov 23, 2025

0.0.0.dev20251123030249 pre-release

Nov 23, 2025

0.0.0.dev20251122223407 pre-release

Nov 22, 2025

0.0.0.dev20251122221025 pre-release

Nov 22, 2025

0.0.0.dev20251122212102 pre-release

Nov 22, 2025

0.0.0.dev20251122210604 pre-release

Nov 22, 2025

0.0.0.dev20251122210529 pre-release

Nov 22, 2025

0.0.0.dev20251122202626 pre-release

Nov 22, 2025

0.0.0.dev20251122201355 pre-release

Nov 22, 2025

0.0.0.dev20251122195847 pre-release

Nov 22, 2025

0.0.0.dev20250904233630 pre-release

Sep 4, 2025

This version

0.0.0.dev20250904214507 pre-release

Sep 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twevals-0.0.0.dev20250904214507.tar.gz (33.3 kB view details)

Uploaded Sep 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

twevals-0.0.0.dev20250904214507-py3-none-any.whl (35.0 kB view details)

Uploaded Sep 4, 2025 Python 3

File details

Details for the file twevals-0.0.0.dev20250904214507.tar.gz.

File metadata

Download URL: twevals-0.0.0.dev20250904214507.tar.gz
Upload date: Sep 4, 2025
Size: 33.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for twevals-0.0.0.dev20250904214507.tar.gz
Algorithm	Hash digest
SHA256	`6183cf17cb1a7336e81848c704aacaf6b7dcaf7d0980a14a8e18ad739e048b48`
MD5	`2f19c9670a01a8baa2bde8cebdfb689e`
BLAKE2b-256	`2c40a97db4b116c26c0ce0e64101d874829cce83fd23ef6ddb9990d5a16cbd3e`

See more details on using hashes here.

File details

Details for the file twevals-0.0.0.dev20250904214507-py3-none-any.whl.

File metadata

Download URL: twevals-0.0.0.dev20250904214507-py3-none-any.whl
Upload date: Sep 4, 2025
Size: 35.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for twevals-0.0.0.dev20250904214507-py3-none-any.whl
Algorithm	Hash digest
SHA256	`26f2900c82016fd239eccb0350491794069b0517e939663aacedf72f32ead7a3`
MD5	`81d04671a395b5e62f2233fa4d1b46ff`
BLAKE2b-256	`b3d8645833699eeb6f76d8e34a6ab88f096972c743b08e28ba1213e2a9c43cea`

See more details on using hashes here.

twevals 0.0.0.dev20250904214507

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Twevals

Install

Quick Start: Serve the Web UI (recommended)

Browse Results in a Web UI

Minimal Eval (sync)

Returning Many Results From One Function

Parametrized Evals (pytest‑style)

Async Evals and Latency

Custom Evaluators (attach scores programmatically)

Organizing and Discovering Evals

Result Shape (EvalResult)

CLI Reference (common)

Developing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes