A lightweight, code-first evaluation framework for testing AI agents and LLM applications
Project description
Twevals
Code‑first evaluation for AI agents and LLM apps. This README focuses on concrete, copy‑pasteable examples and CLI usage. See examples/ for runnable demos.
Install
pip install twevals # Python 3.10+
Develop this repo (Poetry):
poetry install
Quick Start: Serve the Web UI (recommended)
Spin up the UI, run evals once, and browse the results:
poetry run twevals serve examples
Useful options:
# Filter by dataset/labels
poetry run twevals serve examples --dataset customer_service
poetry run twevals serve examples --label production --label test
# Concurrency and dev hot‑reload
poetry run twevals serve examples -c 4 --dev
# Quiet logs (access logs off)
poetry run twevals serve examples --quiet
Tip: Use the in‑UI Actions ▾ menu to Refresh, Rerun the full suite, and Export JSON/CSV.
Browse Results in a Web UI
Launch a lightweight FastAPI app to view results in your browser. Each serve run executes evals once, saves a fresh JSON under .twevals/runs/, and the UI reads from that file on refresh:
poetry run twevals serve examples
# Options:
# -d, --dataset TEXT Filter by dataset(s) (comma-separated)
# -l, --label TEXT Filter by label(s) (repeatable)
# -c, --concurrency INT Number of concurrent evals (0 = sequential)
# --dev Enable hot‑reload (dev UX; watches repo)
# --host TEXT Host interface (default 127.0.0.1)
# --port INT Port (default 8000)
# -v, --verbose Verbose server logs
# -q, --quiet Reduce logs; hide access logs
Results storage and UI:
- Saves to
.twevals/runs/<YYYY-MM-DDTHH-MM-SSZ>.jsonand a portable copy at.twevals/runs/latest.json. - UI loads results from JSON on every refresh; external edits are reflected.
- Inline editing via API:
PATCH /api/runs/{run_id}/results/{index}fordataset,labels, andresult.{scores,metadata,error,reference,annotation}.
UI features:
- Expandable rows with rich detail panels (input/output/reference, metadata JSON, run data, scores, annotation).
- Inline editing: edit dataset, labels, metadata JSON, scores (key/value/passed/notes), and a free‑form annotation; changes persist to JSON.
- Actions menu: Refresh, Rerun full suite, Export JSON, Export CSV.
- Sortable headers: click to sort; Shift+click to multi‑sort.
- Column toggles + resizable columns; choices and widths persist via localStorage; quick reset controls for columns/sorting/widths.
- Polished table styling with latency badges and label chips.
Dev mode:
poetry run twevals serve examples --devenables hot‑reload for code/templates; useful while iterating on evals or the UI.
Minimal Eval (sync)
Create any .py file and decorate a function. Dataset defaults to the filename; labels are optional.
from twevals import eval, EvalResult
@eval # dataset inferred from file name
def test_single_case():
return EvalResult(
input="Hi there",
output="Hello! How can I help you today?",
)
Run it:
poetry run twevals run path/to/that_file.py
Or browse it in the UI:
poetry run twevals serve path/to/that_file.py
Returning Many Results From One Function
Return a list of EvalResult if you want to iterate your own test cases inside a single eval (good for small, hand‑rolled suites).
from twevals import eval, EvalResult
@eval(dataset="customer_service", labels=["production"])
def test_refund_requests():
test_cases = [
("I want a refund", "refund"),
("Money back please", "refund"),
("Cancel and refund", "refund"),
]
results = []
for prompt, expected_keyword in test_cases:
output = f"Processing {prompt} …"
results.append(
EvalResult(
input=prompt,
output=output,
reference=expected_keyword,
scores={"key": "keyword_match", "passed": expected_keyword in output.lower()},
)
)
return results
See: examples/demo_eval.py (also shows async + custom latency).
Parametrized Evals (pytest‑style)
Use @parametrize to automatically generate one eval per case (helps with reporting and filters). Stack @parametrize decorators for a cartesian product.
from twevals import eval, EvalResult, parametrize
@eval(dataset="sentiment_analysis")
@parametrize("text,expected", [
("I love this product!", "positive"),
("This is terrible", "negative"),
])
def test_sentiment(text, expected):
detected = "positive" if "love" in text.lower() else "negative"
return EvalResult(
input=text,
output=detected,
reference=expected,
scores={"key": "accuracy", "passed": detected == expected},
)
Use dicts and IDs:
@eval(dataset="math_operations", labels=["unit_test"])
@parametrize("operation,a,b,expected", [
{"operation": "add", "a": 2, "b": 3, "expected": 5},
{"operation": "multiply", "a": 4, "b": 7, "expected": 28},
])
def test_calculator(operation, a, b, expected):
ops = {"add": lambda x,y: x+y, "multiply": lambda x,y: x*y}
result = ops[operation](a, b)
return EvalResult(
input={"operation": operation, "a": a, "b": b},
output=result,
reference=expected,
scores={"key": "correctness", "passed": result == expected},
)
@eval(dataset="qa_system")
@parametrize(
"question,context,expected",
[
("What is the capital of France?", "France …", "Paris"),
("Who wrote Romeo and Juliet?", "Shakespeare …", "Shakespeare"),
],
ids=["geography", "literature"],
)
def test_qa(question, context, expected):
# …
return EvalResult(input={"q": question, "ctx": context}, output="Paris", reference=expected)
Cartesian product and async:
@eval(dataset="model_comparison")
@parametrize("model", ["gpt-4.1", "gpt-5"])
@parametrize("temperature", [0.0, 1.0])
async def test_model_temperatures(model, temperature):
return EvalResult(
input={"model": model, "temperature": temperature},
output=f"Response from {model} at {temperature}",
scores={"key": "creativity", "value": min(temperature * 0.8 + 0.2, 1.0)},
metadata={"model": model, "temperature": temperature},
)
See: examples/demo_eval_paramatrize.py.
Async Evals and Latency
@eval works with sync and async functions. The decorator measures function execution time and fills latency if you don’t set it. If you measure just your model/agent call, set latency on each EvalResult yourself.
import asyncio, time
from twevals import eval, EvalResult
async def run_agent(prompt: str):
start = time.time()
await asyncio.sleep(0.2) # simulate call
return EvalResult(input=prompt, output="ok", latency=time.time() - start)
@eval(dataset="customer_service")
async def test_refund_requests():
return [await run_agent("I want a refund")] # list of EvalResult
Custom Evaluators (attach scores programmatically)
You can provide evaluators=[...] to @eval(...). Each evaluator receives an EvalResult and can return a Score-like dict, a list of scores, or a new EvalResult. Returned scores are appended to the result.
from twevals import eval, EvalResult
def reference_match(result: EvalResult):
ok = result.reference and str(result.reference).lower() in str(result.output).lower()
return {"key": "reference_match", "passed": bool(ok)}
@eval(dataset="sentiment_analysis", evaluators=[reference_match])
def test_case():
return EvalResult(input="I love it", output="positive", reference="positive")
See: examples/demo_eval_paramatrize.py (first example).
Organizing and Discovering Evals
- Put evals in
.pyfiles; discovery ignores files starting with_. - If you don’t pass
dataset=..., it defaults to the filename. - Use short, lowercase labels (e.g.,
prod,smoke) for filtering.
Run by path or file:
poetry run twevals run tests/
poetry run twevals run examples/demo_eval.py
Filter by dataset/label:
poetry run twevals run tests/ --dataset my_dataset
poetry run twevals run tests/ --label prod --label smoke
Save results and inspect:
poetry run twevals run tests/ -o results.json
cat results.json # contains summary + all results
Result Shape (EvalResult)
EvalResult is Pydantic‑validated. Scores can be a single score (dict) or a list of scores.
from twevals import EvalResult
EvalResult(
input="...", # required: Input that was used to generate the output
output="...", # required: Output that was generated
reference="...", # optional: Expected output
scores=[ # optional (dict or list of dicts): Evaluation results
{"key": "accuracy", "value": 0.93},
{"key": "pass", "passed": True, "notes": "ok"},
],
error=None, # optional: Error message (string)
latency=0.123, # optional: Execution time in seconds
metadata={"model": "gpt-4"}, # optional: Additional custom data
run_data={"attempts": 3}, # optional: Extra run‑specific JSON stored and shown in UI
)
Notes:
- Each score must include either
value(numeric) orpassed(boolean).notesis optional and shown in the UI. - The UI also supports saving a single free‑form
annotationper result via the web editor/API; this is persisted in the JSON results.
CLI Reference (common)
# Serve and browse (recommended)
twevals serve examples --quiet
# Open http://127.0.0.1:8000
# Headless run (save to files, no UI)
twevals run path/or/file.py
# Filter
twevals run tests/ --dataset my_dataset
twevals run tests/ --label prod --label smoke
# Concurrency, verbose, save JSON/CSV
twevals run tests/ -c 4 -v -o results.json --csv results.csv
Developing
poetry install
poetry run pytest -q
poetry run pytest --cov=twevals # coverage
poetry run ruff check twevals tests
poetry run black .
Helpful demo entry-point:
poetry run twevals serve examples
For deeper module internals, see twevals/README.md. The tests under tests/ demonstrate discovery, filtering, CLI options, async handling, and formatting.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file twevals-0.0.0.dev20250904214507.tar.gz.
File metadata
- Download URL: twevals-0.0.0.dev20250904214507.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6183cf17cb1a7336e81848c704aacaf6b7dcaf7d0980a14a8e18ad739e048b48
|
|
| MD5 |
2f19c9670a01a8baa2bde8cebdfb689e
|
|
| BLAKE2b-256 |
2c40a97db4b116c26c0ce0e64101d874829cce83fd23ef6ddb9990d5a16cbd3e
|
File details
Details for the file twevals-0.0.0.dev20250904214507-py3-none-any.whl.
File metadata
- Download URL: twevals-0.0.0.dev20250904214507-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26f2900c82016fd239eccb0350491794069b0517e939663aacedf72f32ead7a3
|
|
| MD5 |
81d04671a395b5e62f2233fa4d1b46ff
|
|
| BLAKE2b-256 |
b3d8645833699eeb6f76d8e34a6ab88f096972c743b08e28ba1213e2a9c43cea
|