Skip to main content

Pydantic schemas and a writer for the evals-viewer on-disk format — the Python writer side of the evals-viewer framework.

Project description

evals-viewer-io

Pydantic schemas and a writer for the evals-viewer on-disk format. This is the Python writer side of the framework — it produces the JSON tree that @ideonate/evals-viewer-server reads and the Vue frontend @ideonate/evals-viewer-core renders.

Install

pip install evals-viewer-io

Requires Python 3.10+ and Pydantic 2.

What's in the box

Symbol Purpose
RunMetadata, EvalSummary, CaseSummary, AggregateStats Pydantic models matching the on-disk format
TokenUsage Token / cost model with addition, from_pydantic_ai adapter, per-model breakdown
save_run_metadata, save_eval_results Filesystem writers — given models and dicts, write JSON in the layout the viewer expects
compute_aggregates(cases) Group case.scores[evaluator] across cases → {evaluator: {mean, min, max}}
compute_token_totals(cases) Sum token usage / cost / per-model breakdown across cases
eval_run_dir (pytest fixture) Optional fixture creating a fresh run directory under EVALS_RESULTS_DIR

Quickstart: minimal end-to-end

from evals_viewer_io import (
    RunMetadata, EvalSummary, CaseSummary, TokenUsage,
    compute_aggregates, compute_token_totals,
    save_eval_results,
)

# 1. Build per-case rows. The output_summary dict is a free-form bag of
#    fields the viewer can show in the eval-detail table; token fields
#    use the canonical input_tokens / output_tokens / cost_usd / usage_by_model.
cases = [
    CaseSummary(
        name="case_001",
        scores={"Accuracy": 0.9, "Coverage": 0.8},
        judge_reasons={"Accuracy": "All key facts present."},
        output_summary={
            "input_tokens": 1234,
            "output_tokens": 567,
            "cost_usd": 0.012,
        },
    ),
    CaseSummary(
        name="case_002",
        scores={"Accuracy": 0.7, "Coverage": 0.9},
        output_summary={"input_tokens": 980, "output_tokens": 440, "cost_usd": 0.009},
    ),
    CaseSummary(name="case_003", success=False, error="Timeout"),
]

# 2. Compute the per-eval aggregates and write the run.
summary = EvalSummary(
    timestamp="2026-04-07T10:30:00Z",
    aggregates=compute_aggregates(cases),
    cases=cases,
)

save_eval_results(
    results_dir="./tests/test-results/evals",
    run_id="2026-04-07_103000",
    eval_name="my_eval",
    summary=summary,
    outputs={
        "case_001": {"answer": "...", "input_tokens": 1234, "output_tokens": 567, "cost_usd": 0.012},
        "case_002": {"answer": "...", "input_tokens": 980, "output_tokens": 440, "cost_usd": 0.009},
    },
    run=RunMetadata(timestamp="2026-04-07T10:30:00Z", git_commit="abc1234"),
)

That writes:

tests/test-results/evals/2026-04-07_103000/
├── run.json
└── my_eval/
    ├── summary.json
    └── outputs/
        ├── case_001.json
        └── case_002.json

Open the viewer and the run shows up.

Token usage

TokenUsage is a normal Pydantic model with __add__ so you can sum across cases or across model calls:

from evals_viewer_io import TokenUsage

opus_call = TokenUsage(input_tokens=1200, output_tokens=300, cost_usd=0.018)
haiku_call = TokenUsage(input_tokens=800, output_tokens=200, cost_usd=0.0009)

# Per-model breakdown for one case
case_total = TokenUsage(
    input_tokens=opus_call.input_tokens + haiku_call.input_tokens,
    output_tokens=opus_call.output_tokens + haiku_call.output_tokens,
    cost_usd=(opus_call.cost_usd or 0) + (haiku_call.cost_usd or 0),
    usage_by_model={"opus": opus_call, "haiku": haiku_call},
)

# Or just use sum() across multiple cases:
total = sum([case1_usage, case2_usage, case3_usage])

The viewer reads input_tokens, output_tokens, cost_usd, and usage_by_model from both each case's full output JSON and from the per-case row in summary.json's output_summary.

Pydantic-AI adapter

If you use pydantic-ai, there's a one-liner to convert its Usage / RunUsage objects (which use request_tokens / response_tokens rather than input / output):

from evals_viewer_io import TokenUsage

usage = TokenUsage.from_pydantic_ai(result.usage(), cost_usd=my_cost_calc(result))

The adapter uses getattr so this package never imports pydantic-ai itself. Other frameworks (OpenAI SDK, Anthropic SDK, …) can be mapped just as easily — TokenUsage(input_tokens=resp.usage.prompt_tokens, output_tokens=resp.usage.completion_tokens) etc.

Cost is the caller's responsibility. Pricing tables go stale fast and don't belong in this package.

Aggregating tokens across cases

from evals_viewer_io import compute_token_totals

totals = compute_token_totals(cases)
print(totals.input_tokens, totals.output_tokens, totals.cost_usd)
print(totals.usage_by_model)  # per-model breakdown summed across all cases

The function reads input_tokens / output_tokens / cost_usd / usage_by_model from each case's output_summary. Cases that don't have those fields contribute zero.

pytest fixture

# tests/conftest.py
from evals_viewer_io.pytest import eval_run_dir  # noqa: F401
# tests/test_my_eval.py
def test_my_eval(eval_run_dir):
    # eval_run_dir is a pathlib.Path under EVALS_RESULTS_DIR (or a tmp dir),
    # and run.json has already been written.
    ...
    save_eval_results(
        results_dir=eval_run_dir.parent,
        run_id=eval_run_dir.name,
        eval_name="my_eval",
        summary=summary,
        outputs=outputs,
    )

Set EVALS_RESULTS_DIR=tests/test-results/evals (or wherever your project keeps them) so the run lands somewhere the viewer can find.

What this package deliberately does not do

This is intentionally a small package — schemas plus the smallest set of helpers that every consumer would need to write themselves. It does not include:

  • Token field extraction from arbitrary model outputs. Different LLM SDKs name fields differently; the caller knows their own output schema.
  • A pricing table. Costs are pricing × tokens; pricing changes weekly. You compute it, you pass it in via cost_usd.
  • Pydantic→dict serialization. If your case output is a Pydantic model, call .model_dump() yourself before passing it to save_eval_results. Hiding that behind a wrapper would just suppress errors.
  • Coupling to a specific eval framework like pydantic-evals or inspect_ai. The writer takes plain dicts. Frameworks can be added as adapters when there's demand.
  • Schema versioning. The on-disk format is forward-compatible by design (extra="allow" everywhere). If a breaking change ever lands, that's the time for a schema_version field, not now.

On-disk contract

See docs/data-layout.md in the monorepo for the full directory tree and per-file schemas. The TL;DR:

{results_dir}/{run_id}/
├── run.json                       (RunMetadata)
└── {eval_name}/
    ├── summary.json               (EvalSummary: aggregates + per-case rows)
    ├── outputs/{case_name}.json   (full per-case output)
    ├── inputs/{case_name}.json    (optional; saved input fixture)
    └── case-scores/{case_name}.json (optional; per-question scores)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evals_viewer_io-0.0.3.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evals_viewer_io-0.0.3-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file evals_viewer_io-0.0.3.tar.gz.

File metadata

  • Download URL: evals_viewer_io-0.0.3.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evals_viewer_io-0.0.3.tar.gz
Algorithm Hash digest
SHA256 50914dba4887598b7f5aa9094a94280d601fcba79c9f42ab43461c50bd12b7eb
MD5 7eac2c679bc9e1a52ef24e5a3530aa72
BLAKE2b-256 bbdbc73fce30f495769b449f2a490f934708b245ca0e6a67f48da23aef3a3aac

See more details on using hashes here.

Provenance

The following attestation bundles were made for evals_viewer_io-0.0.3.tar.gz:

Publisher: publish.yml on ideonate/evals-viewer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evals_viewer_io-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for evals_viewer_io-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b08ec276dd3e98316a716268d17d172fe4198acb534c6474afe6f8a89603d56a
MD5 47a8df5849fa8df92c39cf76803c3f63
BLAKE2b-256 9e865fe380da4a586bcac8320863c77d8f3d68922f9da8c76463850da02ca78b

See more details on using hashes here.

Provenance

The following attestation bundles were made for evals_viewer_io-0.0.3-py3-none-any.whl:

Publisher: publish.yml on ideonate/evals-viewer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page