Verifiable Evals: an open-source benchmark runner that emits signed, reproducible, Merkle-anchored records so anyone can re-run and verify a leaderboard entry.

These details have not been verified by PyPI

Project links

Project description

Verifiable Evals (`verievals`)

Every eval run produces a signed, reproducible record — model version, prompts, outputs, scores, seed, and config — appended to a public Merkle log. Anyone can re-run and cryptographically verify any leaderboard entry.

The problem
The idea: the evidence is the artifact
Installation
Quickstart
Integrate your own model & benchmark
Architecture
The RunRecord
Cryptographic design
The verification protocol
Trust Score leaderboard
Determinism & the reproducibility model
The logging SDK
Models, benchmarks, and scorers
GSM8K MVP: a locally-run open model, end to end
Command-line interface
Threat model
Project layout
Development
Project status
Roadmap
License

The problem

AI evaluations today are frequently not transparent, accessible, or verifiable. When a lab reports a benchmark number, it is often impossible to recover the exact prompts, decoding parameters, model snapshot, or scoring rules that produced it. The consequences are well documented:

Irreproducibility. The same benchmark, run by two parties, yields different numbers because the setup differed in ways nobody recorded.
Conflicting claims. Different labs publish different scores for the same benchmark and the same model, with no way to adjudicate.
Self-reported trust. A leaderboard entry is, in practice, a claim backed by a PDF — there is no mechanism for a third party to confirm it.

This project takes the position that a benchmark result should be an auditable artifact, not a claim. It implements the machinery to make that true.

The idea: the evidence is the artifact

verievals is a benchmark runner in which the output of a run is a canonical, content-addressed, Ed25519-signed, Merkle-anchored record that contains everything needed to understand and reproduce the result:

the model identity (provider, name, resolved version) and every decoding parameter;
the benchmark identity plus a content hash of its exact tasks and scorer;
the random seed and sampling configuration;
every prompt, every raw output, and every per-task score, verbatim;
the aggregate metrics.

That record is signed and its content hash is appended to an append-only Merkle ledger. The ledger's single root hash commits to every result it contains, and a compact inclusion proof lets anyone confirm a specific record is committed under a published root. No blockchain is required — the root is 32 bytes and can be published in a README, a git tag, or a tweet (an optional L2 anchor can be layered on later).

A verifier can then, with only the public artifacts:

confirm the record has not been altered (content hash matches the body),
confirm who signed it (Ed25519 signature),
confirm it is committed under the published root (Merkle inclusion proof), and
for a deterministic model, re-run the eval and confirm the outputs and scores reproduce bit-for-bit.

prompts + config + seed ──▶ model ──▶ outputs ──▶ scorer ──▶ scores
        │                                                       │
        └──────────────── canonical RunRecord (body) ───────────┘
                                  │
                       sha256 (content address = record id)
                                  │
                 Ed25519 signature   +   Merkle leaf
                                  │
                          published root ◀── inclusion proof ──▶ anyone verifies

Installation

pip install verievals

Optional model backends and dev tooling:

pip install "verievals[anthropic]"   # Claude adapter
pip install "verievals[openai]"      # OpenAI adapter
pip install "verievals[dev]"         # tests, lint, types, build/publish tools

From source (for contributing, or before the PyPI release):

git clone https://github.com/KaushikKC/verievals && cd verievals
pip install -e ".[dev]"

Quickstart

# 1. Generate a signing keypair (the private key stays local, chmod 0600)
verievals keygen --out ./keys/runner

# 2. Run a benchmark deterministically and emit a signed record
verievals run \
    --benchmark benchmarks/arithmetic \
    --model echo \
    --key ./keys/runner.key \
    --out ./records \
    --signer "your-name"

# 3. Append the record to the public Merkle ledger
verievals ledger --ledger ./ledger append ./records/<record-id>.json

# 4. Anyone can verify the record reproduces and is included under the root
verievals verify ./records/<record-id>.json \
    --ledger ./ledger --benchmark benchmarks/arithmetic --model echo

# 5. Build a trust-scored leaderboard
verievals leaderboard --records ./records --ledger ./ledger --trust

A fully offline, deterministic walkthrough of the whole pipeline:

python examples/run_arithmetic.py     # run → sign → ledger → verify → leaderboard
python examples/run_gsm8k.py          # the local-model MVP, replayed from a fixture

Integrate your own model & benchmark

You install verievals as a dependency in your own project — you don't fork or edit its source. Then you plug in two things: your model and your benchmark.

Your model — three options, easiest first:

Already supported (no code): a spec string — ollama:llama3:8b, anthropic:claude-opus-4-8, openai:gpt-4o.

Your own / proprietary model — wrap it in a ~15-line adapter:

from verievals.models.base import ModelAdapter, ModelInfo

class MyModelAdapter(ModelAdapter):
    def generate(self, prompt: str) -> str:
        return my_model(prompt)          # your transformers pipeline / API call
    def info(self) -> ModelInfo:
        return ModelInfo("my-org", "my-model", "v1.0", {})

You already have an eval loop — use the SDK (EvalRecorder) and just rec.log(...) inside your existing loop.

Your benchmark — a folder with benchmark.yaml (id, version, scorer) and tasks.jsonl ({id, prompt, expected} per line). Drop in the full GSM8K split, or your own tasks, in the same format.

Then run it to get a signed, verifiable record:

from verievals.benchmarks.loader import load_benchmark
from verievals.crypto.signing import generate_signing_key
from verievals.runner.engine import run_eval

record = run_eval(load_benchmark("./my-benchmark"), MyModelAdapter(),
                  signing_key=generate_signing_key(), signer="my-org")
print(record.body.aggregate.metrics["accuracy"])

Full walkthrough: docs/INTEGRATION.md · runnable template: examples/custom_model.py.

Architecture

The package is a set of small, single-responsibility modules with a strictly downward dependency flow (no cycles):

verievals
├── crypto       canonical JSON · SHA-256 content addressing · RFC 6962 Merkle · Ed25519
├── records      RunRecord schema (body / envelope / signature) · filesystem store
├── models       adapter interface · echo + fixture (deterministic)
│                                  · ollama (local) · anthropic + openai (hosted)
├── benchmarks   Benchmark/Task models (+ content hash) · yaml+jsonl loader
├── scoring      exact-match · regex · numeric · gsm8k · registry
├── runner       deterministic eval engine → signed RunRecord
├── sdk          EvalRecorder context manager + @logged_eval decorator
├── ledger       append-only Merkle log · 4-stage verification protocol
├── leaderboard  accuracy board · trust-scored board (verification tiers)
└── cli          keygen · run · ledger · verify · leaderboard

Dependency direction: crypto depends on nothing internal → records → runner (also uses benchmarks, models, scoring) → ledger → leaderboard / cli / sdk on top.

Module	Responsibility
`crypto`	Deterministic JSON encoding, content hashing, Merkle tree + inclusion proofs, signing/verification — the trust primitives.
`records`	The `RunRecord` schema and a content-addressed filesystem store.
`models`	A uniform `ModelAdapter` interface and concrete adapters (deterministic, local, hosted).
`benchmarks`	Immutable, versioned benchmark/task definitions with a verifiable content hash, and a yaml+jsonl loader.
`scoring`	Deterministic scorers behind an id registry, so the scorer is pinned by the record.
`runner`	The deterministic engine that runs a benchmark against a model with a scorer and produces a signed record.
`sdk`	Drop-in instrumentation that emits the same signed record from inside an existing eval loop.
`ledger`	The append-only Merkle log and the verification protocol.
`leaderboard`	Ranked boards built only from verifying, ledger-included records.
`cli`	The `verievals` command-line entry point.

The `RunRecord`

The record is the central artifact. It splits cleanly into three parts:

Part	Hashed?	Contents
body	✅	Everything deterministic and reproducible: benchmark spec (with content hash), model spec + params, runner config (seed, sample limit), per-task results (prompt / output / expected / score / passed), aggregate metrics.
envelope	❌	Non-deterministic metadata: wall-clock `created_at`, optional signer label.
signature	—	Ed25519 signature over the content hash, plus the signer's public key.

The body's canonical SHA-256 is both the record id and the Merkle leaf. This split is what lets a record be both signable and reproducible: re-running the body yields the same content hash regardless of when or by whom it was run, while the timestamp and signature live outside the hashed payload. Records are stored as human-readable JSON named <record-id>.json — the filename is itself a verifiable claim about the contents.

Schema version: verievals/runrecord/v1. Full reference: docs/record-format.md.

Cryptographic design

Canonical JSON (crypto/canonical.py). A pragmatic subset of RFC 8785: keys sorted, no insignificant whitespace, UTF-8, NaN/Infinity rejected. This makes the byte serialization of a logical value a pure function of that value, so two parties hashing "the same" record get identical bytes.
Content addressing (crypto/hashing.py). content_hash(x) = sha256(canonical_json(x)), lowercase hex. A stable identifier for any logical object.
Merkle tree (crypto/merkle.py). The RFC 6962 (Certificate Transparency) construction: domain-separated leaf (0x00) and internal-node (0x01) hashing, and the split-at-largest-power-of-two tree shape. This is widely reviewed and avoids the second-preimage weakness of naive "duplicate the last leaf" trees. Inclusion proofs use the RFC 6962 audit-path verification algorithm. Verified by property tests across tree sizes 1–24, plus hand-computed known-answer vectors.
Signatures (crypto/signing.py). Ed25519 via the cryptography library: deterministic signatures, 32-byte public keys. Keys are hex-encoded; private keys are git-ignored and written with 0600 permissions.

The verification protocol

Four independent checks, in increasing order of strength. A verifier runs as many as it has inputs for; the result is "verified" iff every performed check passes.

#	Check	Proves	Inputs needed
1	Integrity	the body has not been altered	the record alone
2	Signature	the result is attributable to a key	the record alone
3	Inclusion	the record is committed under a published root	the ledger + root
4	Reproduction	the outputs/scores follow from the inputs	the benchmark + a deterministic model

$ verievals verify <record>.json --ledger ./ledger --benchmark benchmarks/arithmetic --model echo
Record:       56c7984e…
Integrity:    ok
Signature:    ok
Inclusion:    ok
Reproduction: ok
Overall:      VERIFIED          # exit code 0 (1 if not verified) — CI-gateable

Details: docs/verification.md.

Trust Score leaderboard

A plain leaderboard ranks by accuracy. A trust-scored leaderboard answers the question the project exists for — how much should you believe this number? — by assigning each entry the strongest verification tier that holds for it:

Tier	Score	Earned when
🟢 `reproduced`	1.00	re-running the eval reproduced the exact signed body hash
🔵 `verified`	0.80	valid signature and committed under the ledger root
🟡 `signed`	0.50	valid signature, but not in the ledger
🔴 `self_reported`	0.20	no valid signature — a bare claim

Entries sort by tier first, then accuracy — a verified result with lower accuracy outranks a self-reported one with a higher claimed number, by design. This makes the gap between cryptographically-backed and self-reported results explicit and rankable.

| # | Trust         | Score | Benchmark           | Model              | Accuracy | Tasks | Signer  |
|---|---------------|-------|---------------------|--------------------|----------|-------|---------|
| 1 | 🟢 reproduced | 1.00  | arithmetic@1.0.0    | mock:solver        | 1.000    | 15    | solver  |
| 2 | 🔵 verified   | 0.80  | gsm8k@sample-1.0.0  | ollama:gemma3:1b   | 0.917    | 12    | kaushik |

Details: docs/trust-score.md.

Determinism & the reproducibility model

Everything that can change a result lives in the hashed body: the benchmark tasks (via their content hash), the model identity and decoding params, the scorer id, the seed, and the sample limit.

Deterministic models (echo, fixture) re-run bit-for-bit and earn the reproduced tier directly.
Non-deterministic models (hosted APIs, local GPU inference) are not bit-reproducible across hardware/driver versions; the adapters report deterministic = False. The record still pins exactly what was requested, and the recommended pattern is to capture a run's prompt→output pairs into a FixtureModel, which makes that exact run replayable and verifiable offline by anyone, with no API key, server, or model download.

This capture-to-fixture pattern is what turns "we ran gemma3:1b and got 0.917" into a claim a CI job can re-check (see the GSM8K MVP below).

The logging SDK

For teams that already have an evaluation loop, the SDK provides verifiable provenance without handing control to the runner. You wrap your loop and log each item; on clean exit you get a signed RunRecord that is byte-identical in structure to one the runner would produce for the same data (enforced by a test).

from verievals.sdk import EvalRecorder

with EvalRecorder("gsm8k", "1.0", model_info, signing_key, scorer="gsm8k") as rec:
    for task in dataset:
        output = my_model(task.prompt)
        rec.log(task.id, task.prompt, output, expected=task.answer)
        # or, with your own grader:  rec.log(..., score=0.5, passed=False)

record = rec.record          # signed, ready to append to a ledger

A @logged_eval(...) decorator wraps a logging function and returns the finished record. If the loop raises, the record is not finalized — partial runs never produce a misleading signed artifact. Details: docs/sdk.md.

Models, benchmarks, and scorers

Model adapters resolve from a spec string via verievals.models.load_model:

Spec	Adapter	Notes
`echo`	`EchoModel`	deterministic; extracts the answer from the prompt — for plumbing/repro
`fixture:<path>`	`FixtureModel`	deterministic; replays recorded prompt→output pairs
`ollama[:<tag>]`	`OllamaAdapter`	local open models via Ollama; stdlib-only, no ML deps
`anthropic[:<model>]`	`AnthropicAdapter`	Claude via the Anthropic SDK (default `claude-opus-4-8`)
`openai[:<model>]`	`OpenAIAdapter`	OpenAI Chat Completions — lab-neutral by design

Bundled benchmarks (each a directory of benchmark.yaml + tasks.jsonl):

Benchmark	Scorer	Tasks	Demonstrates
`arithmetic`	`numeric`	15	numeric answer extraction
`capitals`	`exact_match`	10	normalized non-numeric matching
`gsm8k`	`gsm8k`	12	chain-of-thought, final-number extraction (sample subset)

Scorers (deterministic, resolved by id so they are pinned in the record): exact_match (case/whitespace-normalized), regex_match, numeric (first number, tolerance), and gsm8k (final number — the chain-of-thought convention).

GSM8K MVP: a locally-run open model, end to end

The milestone-1 MVP evaluates a small open-source model on grade-school math entirely on your own machine, then makes the run independently re-verifiable.

Model: any Ollama model via --model ollama:<tag>; defaults to gemma3:1b (~815 MB, runs in seconds on Apple Silicon, no download if pulled). Point it at ollama:llama3:8b or ollama:deepseek-r1:7b for stronger models.
Benchmark: benchmarks/gsm8k — a 12-item GSM8K-style sample with integer answers (drop-in replaceable with the official split).
Scorer: gsm8k (final-number extraction), tolerant of full chain-of-thought.

Measured result (Apple M3, 16 GB): gemma3:1b scored 11/12 (0.917) on the sample. The single miss is a genuine reasoning error by the 1B model, not a harness issue. The exact outputs are captured into examples/fixtures/gsm8k_gemma3_1b.json, and a CI test replays them offline to the same 11/12 — so the local-model result is reproducible by anyone with no server and no download.

ollama pull gemma3:1b
verievals run --benchmark benchmarks/gsm8k --model ollama:gemma3:1b \
    --key ./keys/runner.key --out ./records --signer me
python examples/run_gsm8k.py          # offline replay of the captured run

Details: docs/gsm8k-mvp.md.

Command-line interface

Command	Purpose
`verievals keygen --out <prefix>`	generate an Ed25519 keypair (`.key` private 0600, `.pub` public)
`verievals run --benchmark <dir> --model <spec> --key <file> --out <dir>`	run + emit a signed record
`verievals ledger --ledger <dir> append <record>`	verify & append a record to the Merkle log
`verievals ledger --ledger <dir> root`	print the current Merkle root
`verievals verify <record> [--ledger <dir>] [--benchmark <dir> --model <spec>]`	run the verification protocol
`verievals leaderboard --records <dir> --ledger <dir> [--trust]`	build an accuracy or trust-scored board

Full reference: docs/cli.md.

Threat model

Defends against: result tampering (any edit changes the content hash and breaks the signature), benchmark swapping (the tasks are content-hashed and reproduction requires the matching benchmark), cherry-picked configs (decoding params, seed, sampler, scorer are all in the hashed body), ledger back-dating (a record not under the published root fails inclusion), and forged authorship (only the key holder can sign).

Explicit non-goals: key-to-identity binding (we prove a key signed it, not who owns the key — distribute trusted public keys out of band), hosted/GPU model bit-reproducibility (use a captured fixture), benchmark quality (verifiability is not validity), private-key custody, and ledger availability/consensus. Full analysis: docs/threat-model.md.

Project layout

verifiable-evals/
├── src/verievals/              # the package (9 modules, see Architecture)
├── benchmarks/                 # bundled benchmark data (arithmetic, capitals, gsm8k)
├── examples/                   # end-to-end demos + captured fixtures
├── docs/                       # architecture, record-format, verification, sdk,
│                               #   trust-score, gsm8k-mvp, integration, releasing,
│                               #   threat-model, cli
├── tests/                      # mirrors the package; 204 tests, 97% coverage
├── .github/workflows/ci.yml    # lint + mypy + pytest matrix (py3.10–3.12)
├── pyproject.toml  Makefile  CHANGELOG.md  CONTRIBUTING.md  LICENSE

Development

make install     # pip install -e ".[dev]"
make lint        # ruff check + format check
make type        # mypy (disallow_untyped_defs)
make test        # pytest
make cov         # pytest with coverage
make check       # lint + type + test

Standards enforced in CI (Python 3.10/3.11/3.12 matrix): ruff lint + format, mypy with disallow_untyped_defs, and the full test suite. The package ships a PEP 561 py.typed marker.

Publishing to PyPI (so pip install verievals works for everyone) is documented step-by-step in docs/RELEASING.md.

Project status

Stage: working v0.1.0 — the cryptographic core, runner, ledger, verifier, SDK, trust leaderboard, CLI, and the GSM8K local-model MVP are complete and tested.
Tests: 204 passing · coverage: 97% · mypy and ruff clean · wheel builds.
Three foundational milestones delivered:
1. MVP for a single open-source model — gemma3:1b on a GSM8K sample via Ollama, with a fixture for offline reproduction (llama3:8b is one flag away).
2. Open-source logging SDK — EvalRecorder / @logged_eval, byte-identical to the runner's output.
3. Public Trust Score leaderboard — verification-tiered ranking distinguishing cryptographically-backed from self-reported numbers.

Roadmap

Wire the official full GSM8K split (1,319 problems) behind the existing loader.
A transformers/HuggingFace adapter for offline weights without a server.
Publish a hosted, continuously-updated Trust Score board from a public ledger.
Optional L2 anchoring of the Merkle root for external timestamping.
Witness/co-signing and multi-party attestation of records.

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verievals-0.1.0.tar.gz (75.1 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

verievals-0.1.0-py3-none-any.whl (56.7 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file verievals-0.1.0.tar.gz.

File metadata

Download URL: verievals-0.1.0.tar.gz
Upload date: Jun 12, 2026
Size: 75.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verievals-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`41cfc1724c81335683774fd1b55356c54a3baf42600e7f80d06a38686b0e0296`
MD5	`b501b177721a5949f86e77e8817832ec`
BLAKE2b-256	`d282412b8846160b2971ed1276cea93cbd21354a02db0bf48681818bcc742250`

See more details on using hashes here.

File details

Details for the file verievals-0.1.0-py3-none-any.whl.

File metadata

Download URL: verievals-0.1.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 56.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verievals-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`550776f95474b6b1376eebd169da5994c28ec38fb57eaa848e42df45b15564d7`
MD5	`1764440f9c070ae6720b69b31d43dc91`
BLAKE2b-256	`5a4781f63e5a923ac61142c596c8c46b8a70ab7ee9d5a23b785840396f8cf229`

See more details on using hashes here.

verievals 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Verifiable Evals (verievals)

Table of contents

The problem

The idea: the evidence is the artifact

Installation

Quickstart

Integrate your own model & benchmark

Architecture

The RunRecord

Cryptographic design

The verification protocol

Trust Score leaderboard

Determinism & the reproducibility model

The logging SDK

Models, benchmarks, and scorers

GSM8K MVP: a locally-run open model, end to end

Command-line interface

Threat model

Project layout

Development

Project status

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Verifiable Evals (`verievals`)

The `RunRecord`