Skip to main content

Pydantic schemas and a writer for the evals-viewer on-disk format — the Python writer side of the evals-viewer framework.

Project description

evals-viewer-io

Pydantic schemas and a writer for the evals-viewer on-disk format. This is the Python writer side of the framework — it produces the JSON tree that @ideonate/evals-viewer-server reads and the Vue frontend @ideonate/evals-viewer-core renders.

Install

pip install evals-viewer-io

Requires Python 3.10+ and Pydantic 2.

What's in the box

Symbol Purpose
RunMetadata, EvalSummary, CaseSummary, AggregateStats Pydantic models matching the on-disk format
TokenUsage Token / cost model with addition, from_pydantic_ai adapter, per-model breakdown
save_run_metadata, save_eval_results Filesystem writers — given models and dicts, write JSON in the layout the viewer expects
compute_aggregates(cases) Group case.scores[evaluator] across cases → {evaluator: {mean, min, max}}
compute_token_totals(cases) Sum token usage / cost / per-model breakdown across cases
eval_run_dir (pytest fixture) Optional fixture creating a fresh run directory under EVALS_RESULTS_DIR

Quickstart: minimal end-to-end

from evals_viewer_io import (
    RunMetadata, EvalSummary, CaseSummary, TokenUsage,
    compute_aggregates, compute_token_totals,
    save_eval_results,
)

# 1. Build per-case rows. The output_summary dict is a free-form bag of
#    fields the viewer can show in the eval-detail table; token fields
#    use the canonical input_tokens / output_tokens / cost_usd / usage_by_model.
cases = [
    CaseSummary(
        name="case_001",
        scores={"Accuracy": 0.9, "Coverage": 0.8},
        judge_reasons={"Accuracy": "All key facts present."},
        output_summary={
            "input_tokens": 1234,
            "output_tokens": 567,
            "cost_usd": 0.012,
        },
    ),
    CaseSummary(
        name="case_002",
        scores={"Accuracy": 0.7, "Coverage": 0.9},
        output_summary={"input_tokens": 980, "output_tokens": 440, "cost_usd": 0.009},
    ),
    CaseSummary(name="case_003", success=False, error="Timeout"),
]

# 2. Compute the per-eval aggregates and write the run.
summary = EvalSummary(
    timestamp="2026-04-07T10:30:00Z",
    aggregates=compute_aggregates(cases),
    cases=cases,
)

save_eval_results(
    results_dir="./tests/test-results/evals",
    run_id="2026-04-07_103000",
    eval_name="my_eval",
    summary=summary,
    outputs={
        "case_001": {"answer": "...", "input_tokens": 1234, "output_tokens": 567, "cost_usd": 0.012},
        "case_002": {"answer": "...", "input_tokens": 980, "output_tokens": 440, "cost_usd": 0.009},
    },
    run=RunMetadata(timestamp="2026-04-07T10:30:00Z", git_commit="abc1234"),
)

That writes:

tests/test-results/evals/2026-04-07_103000/
├── run.json
└── my_eval/
    ├── summary.json
    └── outputs/
        ├── case_001.json
        └── case_002.json

Open the viewer and the run shows up.

Token usage

TokenUsage is a normal Pydantic model with __add__ so you can sum across cases or across model calls:

from evals_viewer_io import TokenUsage

opus_call = TokenUsage(input_tokens=1200, output_tokens=300, cost_usd=0.018)
haiku_call = TokenUsage(input_tokens=800, output_tokens=200, cost_usd=0.0009)

# Per-model breakdown for one case
case_total = TokenUsage(
    input_tokens=opus_call.input_tokens + haiku_call.input_tokens,
    output_tokens=opus_call.output_tokens + haiku_call.output_tokens,
    cost_usd=(opus_call.cost_usd or 0) + (haiku_call.cost_usd or 0),
    usage_by_model={"opus": opus_call, "haiku": haiku_call},
)

# Or just use sum() across multiple cases:
total = sum([case1_usage, case2_usage, case3_usage])

The viewer reads input_tokens, output_tokens, cost_usd, and usage_by_model from both each case's full output JSON and from the per-case row in summary.json's output_summary.

Pydantic-AI adapter

If you use pydantic-ai, there's a one-liner to convert its Usage / RunUsage objects (which use request_tokens / response_tokens rather than input / output):

from evals_viewer_io import TokenUsage

usage = TokenUsage.from_pydantic_ai(result.usage(), cost_usd=my_cost_calc(result))

The adapter uses getattr so this package never imports pydantic-ai itself. Other frameworks (OpenAI SDK, Anthropic SDK, …) can be mapped just as easily — TokenUsage(input_tokens=resp.usage.prompt_tokens, output_tokens=resp.usage.completion_tokens) etc.

Cost is the caller's responsibility. Pricing tables go stale fast and don't belong in this package.

Aggregating tokens across cases

from evals_viewer_io import compute_token_totals

totals = compute_token_totals(cases)
print(totals.input_tokens, totals.output_tokens, totals.cost_usd)
print(totals.usage_by_model)  # per-model breakdown summed across all cases

The function reads input_tokens / output_tokens / cost_usd / usage_by_model from each case's output_summary. Cases that don't have those fields contribute zero.

pytest fixture

# tests/conftest.py
from evals_viewer_io.pytest import eval_run_dir  # noqa: F401
# tests/test_my_eval.py
def test_my_eval(eval_run_dir):
    # eval_run_dir is a pathlib.Path under EVALS_RESULTS_DIR (or a tmp dir),
    # and run.json has already been written.
    ...
    save_eval_results(
        results_dir=eval_run_dir.parent,
        run_id=eval_run_dir.name,
        eval_name="my_eval",
        summary=summary,
        outputs=outputs,
    )

Set EVALS_RESULTS_DIR=tests/test-results/evals (or wherever your project keeps them) so the run lands somewhere the viewer can find.

What this package deliberately does not do

This is intentionally a small package — schemas plus the smallest set of helpers that every consumer would need to write themselves. It does not include:

  • Token field extraction from arbitrary model outputs. Different LLM SDKs name fields differently; the caller knows their own output schema.
  • A pricing table. Costs are pricing × tokens; pricing changes weekly. You compute it, you pass it in via cost_usd.
  • Pydantic→dict serialization. If your case output is a Pydantic model, call .model_dump() yourself before passing it to save_eval_results. Hiding that behind a wrapper would just suppress errors.
  • Coupling to a specific eval framework like pydantic-evals or inspect_ai. The writer takes plain dicts. Frameworks can be added as adapters when there's demand.
  • Schema versioning. The on-disk format is forward-compatible by design (extra="allow" everywhere). If a breaking change ever lands, that's the time for a schema_version field, not now.

On-disk contract

See docs/data-layout.md in the monorepo for the full directory tree and per-file schemas. The TL;DR:

{results_dir}/{run_id}/
├── run.json                       (RunMetadata)
└── {eval_name}/
    ├── summary.json               (EvalSummary: aggregates + per-case rows)
    ├── outputs/{case_name}.json   (full per-case output)
    ├── inputs/{case_name}.json    (optional; saved input fixture)
    └── case-scores/{case_name}.json (optional; per-question scores)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evals_viewer_io-0.0.4.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evals_viewer_io-0.0.4-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file evals_viewer_io-0.0.4.tar.gz.

File metadata

  • Download URL: evals_viewer_io-0.0.4.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evals_viewer_io-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ed9618abbc849993af8c981a613f31c87951c41ff2050e646c4c8cca9092351f
MD5 cfc03fc7f657034c6fcd96f78937b8ba
BLAKE2b-256 d94461b28d08b52c62c8aecabf646d1f2da21226408ac9a15ddf222359120613

See more details on using hashes here.

Provenance

The following attestation bundles were made for evals_viewer_io-0.0.4.tar.gz:

Publisher: publish.yml on ideonate/evals-viewer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evals_viewer_io-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for evals_viewer_io-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 64d0ed21dc8d74d72a2235cb32c9325ca80494278ec01d42de768b3786604274
MD5 f7a34182d97d5bf69c2b2a1d0a4f51d4
BLAKE2b-256 21c80529ef76ce53b8e0c074026ee8e4c7cd3f2ea86abf02c340b8f863d3348e

See more details on using hashes here.

Provenance

The following attestation bundles were made for evals_viewer_io-0.0.4-py3-none-any.whl:

Publisher: publish.yml on ideonate/evals-viewer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page