Pydantic schemas and a writer for the evals-viewer on-disk format — the Python writer side of the evals-viewer framework.
Project description
evals-viewer-io
Pydantic schemas and a writer for the evals-viewer on-disk format. This is the Python writer side of the framework — it produces the JSON tree that @ideonate/evals-viewer-server reads and the Vue frontend @ideonate/evals-viewer-core renders.
Install
pip install evals-viewer-io
Requires Python 3.10+ and Pydantic 2.
What's in the box
| Symbol | Purpose |
|---|---|
RunMetadata, EvalSummary, CaseSummary, AggregateStats |
Pydantic models matching the on-disk format |
TokenUsage |
Token / cost model with addition, from_pydantic_ai adapter, per-model breakdown |
save_run_metadata, save_eval_results |
Filesystem writers — given models and dicts, write JSON in the layout the viewer expects |
compute_aggregates(cases) |
Group case.scores[evaluator] across cases → {evaluator: {mean, min, max}} |
compute_token_totals(cases) |
Sum token usage / cost / per-model breakdown across cases |
eval_run_dir (pytest fixture) |
Optional fixture creating a fresh run directory under EVALS_RESULTS_DIR |
Quickstart: minimal end-to-end
from evals_viewer_io import (
RunMetadata, EvalSummary, CaseSummary, TokenUsage,
compute_aggregates, compute_token_totals,
save_eval_results,
)
# 1. Build per-case rows. The output_summary dict is a free-form bag of
# fields the viewer can show in the eval-detail table; token fields
# use the canonical input_tokens / output_tokens / cost_usd / usage_by_model.
cases = [
CaseSummary(
name="case_001",
scores={"Accuracy": 0.9, "Coverage": 0.8},
judge_reasons={"Accuracy": "All key facts present."},
output_summary={
"input_tokens": 1234,
"output_tokens": 567,
"cost_usd": 0.012,
},
),
CaseSummary(
name="case_002",
scores={"Accuracy": 0.7, "Coverage": 0.9},
output_summary={"input_tokens": 980, "output_tokens": 440, "cost_usd": 0.009},
),
CaseSummary(name="case_003", success=False, error="Timeout"),
]
# 2. Compute the per-eval aggregates and write the run.
summary = EvalSummary(
timestamp="2026-04-07T10:30:00Z",
aggregates=compute_aggregates(cases),
cases=cases,
)
save_eval_results(
results_dir="./tests/test-results/evals",
run_id="2026-04-07_103000",
eval_name="my_eval",
summary=summary,
outputs={
"case_001": {"answer": "...", "input_tokens": 1234, "output_tokens": 567, "cost_usd": 0.012},
"case_002": {"answer": "...", "input_tokens": 980, "output_tokens": 440, "cost_usd": 0.009},
},
run=RunMetadata(timestamp="2026-04-07T10:30:00Z", git_commit="abc1234"),
)
That writes:
tests/test-results/evals/2026-04-07_103000/
├── run.json
└── my_eval/
├── summary.json
└── outputs/
├── case_001.json
└── case_002.json
Open the viewer and the run shows up.
Token usage
TokenUsage is a normal Pydantic model with __add__ so you can sum across cases or across model calls:
from evals_viewer_io import TokenUsage
opus_call = TokenUsage(input_tokens=1200, output_tokens=300, cost_usd=0.018)
haiku_call = TokenUsage(input_tokens=800, output_tokens=200, cost_usd=0.0009)
# Per-model breakdown for one case
case_total = TokenUsage(
input_tokens=opus_call.input_tokens + haiku_call.input_tokens,
output_tokens=opus_call.output_tokens + haiku_call.output_tokens,
cost_usd=(opus_call.cost_usd or 0) + (haiku_call.cost_usd or 0),
usage_by_model={"opus": opus_call, "haiku": haiku_call},
)
# Or just use sum() across multiple cases:
total = sum([case1_usage, case2_usage, case3_usage])
The viewer reads input_tokens, output_tokens, cost_usd, and usage_by_model from both each case's full output JSON and from the per-case row in summary.json's output_summary.
Pydantic-AI adapter
If you use pydantic-ai, there's a one-liner to convert its Usage / RunUsage objects (which use request_tokens / response_tokens rather than input / output):
from evals_viewer_io import TokenUsage
usage = TokenUsage.from_pydantic_ai(result.usage(), cost_usd=my_cost_calc(result))
The adapter uses getattr so this package never imports pydantic-ai itself. Other frameworks (OpenAI SDK, Anthropic SDK, …) can be mapped just as easily — TokenUsage(input_tokens=resp.usage.prompt_tokens, output_tokens=resp.usage.completion_tokens) etc.
Cost is the caller's responsibility. Pricing tables go stale fast and don't belong in this package.
Aggregating tokens across cases
from evals_viewer_io import compute_token_totals
totals = compute_token_totals(cases)
print(totals.input_tokens, totals.output_tokens, totals.cost_usd)
print(totals.usage_by_model) # per-model breakdown summed across all cases
The function reads input_tokens / output_tokens / cost_usd / usage_by_model from each case's output_summary. Cases that don't have those fields contribute zero.
pytest fixture
# tests/conftest.py
from evals_viewer_io.pytest import eval_run_dir # noqa: F401
# tests/test_my_eval.py
def test_my_eval(eval_run_dir):
# eval_run_dir is a pathlib.Path under EVALS_RESULTS_DIR (or a tmp dir),
# and run.json has already been written.
...
save_eval_results(
results_dir=eval_run_dir.parent,
run_id=eval_run_dir.name,
eval_name="my_eval",
summary=summary,
outputs=outputs,
)
Set EVALS_RESULTS_DIR=tests/test-results/evals (or wherever your project keeps them) so the run lands somewhere the viewer can find.
What this package deliberately does not do
This is intentionally a small package — schemas plus the smallest set of helpers that every consumer would need to write themselves. It does not include:
- Token field extraction from arbitrary model outputs. Different LLM SDKs name fields differently; the caller knows their own output schema.
- A pricing table. Costs are pricing × tokens; pricing changes weekly. You compute it, you pass it in via
cost_usd. - Pydantic→dict serialization. If your case output is a Pydantic model, call
.model_dump()yourself before passing it tosave_eval_results. Hiding that behind a wrapper would just suppress errors. - Coupling to a specific eval framework like
pydantic-evalsorinspect_ai. The writer takes plain dicts. Frameworks can be added as adapters when there's demand. - Schema versioning. The on-disk format is forward-compatible by design (
extra="allow"everywhere). If a breaking change ever lands, that's the time for aschema_versionfield, not now.
On-disk contract
See docs/data-layout.md in the monorepo for the full directory tree and per-file schemas. The TL;DR:
{results_dir}/{run_id}/
├── run.json (RunMetadata)
└── {eval_name}/
├── summary.json (EvalSummary: aggregates + per-case rows)
├── outputs/{case_name}.json (full per-case output)
├── inputs/{case_name}.json (optional; saved input fixture)
└── case-scores/{case_name}.json (optional; per-question scores)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evals_viewer_io-0.0.4.tar.gz.
File metadata
- Download URL: evals_viewer_io-0.0.4.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed9618abbc849993af8c981a613f31c87951c41ff2050e646c4c8cca9092351f
|
|
| MD5 |
cfc03fc7f657034c6fcd96f78937b8ba
|
|
| BLAKE2b-256 |
d94461b28d08b52c62c8aecabf646d1f2da21226408ac9a15ddf222359120613
|
Provenance
The following attestation bundles were made for evals_viewer_io-0.0.4.tar.gz:
Publisher:
publish.yml on ideonate/evals-viewer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evals_viewer_io-0.0.4.tar.gz -
Subject digest:
ed9618abbc849993af8c981a613f31c87951c41ff2050e646c4c8cca9092351f - Sigstore transparency entry: 1256129794
- Sigstore integration time:
-
Permalink:
ideonate/evals-viewer@8eac7bd531c63030a4e127ff7a2d37a274b34a3a -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/ideonate
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8eac7bd531c63030a4e127ff7a2d37a274b34a3a -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file evals_viewer_io-0.0.4-py3-none-any.whl.
File metadata
- Download URL: evals_viewer_io-0.0.4-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64d0ed21dc8d74d72a2235cb32c9325ca80494278ec01d42de768b3786604274
|
|
| MD5 |
f7a34182d97d5bf69c2b2a1d0a4f51d4
|
|
| BLAKE2b-256 |
21c80529ef76ce53b8e0c074026ee8e4c7cd3f2ea86abf02c340b8f863d3348e
|
Provenance
The following attestation bundles were made for evals_viewer_io-0.0.4-py3-none-any.whl:
Publisher:
publish.yml on ideonate/evals-viewer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evals_viewer_io-0.0.4-py3-none-any.whl -
Subject digest:
64d0ed21dc8d74d72a2235cb32c9325ca80494278ec01d42de768b3786604274 - Sigstore transparency entry: 1256129926
- Sigstore integration time:
-
Permalink:
ideonate/evals-viewer@8eac7bd531c63030a4e127ff7a2d37a274b34a3a -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/ideonate
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8eac7bd531c63030a4e127ff7a2d37a274b34a3a -
Trigger Event:
workflow_dispatch
-
Statement type: