Skip to main content

Personal eval benchmark: compare model outcomes across swappable CLI-agent harnesses on custom tasks.

Project description

touchstone

A touchstone is the dark stone jewelers rub gold against to read its purity from the streak it leaves — telling true gold from convincing fakes. That is this benchmark's whole job: telling apart models that look identical on paper, by the marks they leave on real work.

A personal eval benchmark for answering one question: for my usecases, which model works best?

Each eval (a case) bundles its own task, its own input source files, its own AI artifacts (skills / commands / plugins / MCP), and its own definition of a correct outcome. A run executes a matrix of cells — one cell per (case × harness × model × trial) — fully isolated and persisted independently, then aggregates everything into a single report.

Core model

Case (one eval)            Matrix axes            Cell (unit of work + persistence)
  task / prompt      ×   harnesses[]        =     sandbox + transcript + output
  source/ files          models[]                 + grader scores + metrics + status
  artifacts/             trials (k)
  graders[]
  • Harness — the swappable thing that turns a task into an output, behind one interface (harness/base.py). echo (fake) and claude-code are output-only. For rich runs (a Trace of tool calls / tokens / cost) there are two paths: claude-code-stream drives Claude natively over --output-format stream-json (no ACP, no Node; Tracing-only, autonomous via skip-permissions — see docs/adr/0006), and the ACP adapter drives any Agent Client Protocol agent (droid, gemini, codex, claude-acp, devin-cli) with full observation and bidirectional interaction. ACP is one rich path, not the only one — the Trace is the contract.
  • Graderscommand (run tests/build), files (expected files / grep patterns), model_judge (LLM-as-judge), and trace (assert over observed tool usage / token & cost budgets). All run; combined per the case's expect.pass_threshold.
  • Observation & interaction (opt-in per case via observe:) — capture a normalized Trace (tool calls, tokens, cost, permission events) and answer the agent's mid-run requests with an Interaction Policy (auto-approve/auto-deny/scripted/ llm-based/manual). See CONTEXT.md + docs/adr/.
  • Resumability & parallelism — each cell's result.json is the source of truth (the manifest is a derived index), so cells run in parallel (--workers) without contention and run --resume <id> continues after a crash.

Install

The published package is touchstone-eval; the command it installs is touchstone (the bare touchstone name on PyPI belongs to an unrelated, abandoned project).

uvx touchstone-eval --help          # run without installing (recommended)
pipx install touchstone-eval        # or install as an isolated tool
pip install touchstone-eval         # or into the current environment

Add the optional extras when you need them: touchstone-eval[judge] (Anthropic SDK for model_judge), [langfuse] (export), [dev] (pytest).

For local development from a checkout:

pip install -e ".[judge,dev]"   # judge = Anthropic SDK for model_judge; dev = pytest

Usage

touchstone validate                 # schema-check every evals/<case>/case.yaml
touchstone list                     # list cases and past runs
touchstone run                      # run the whole evals/ suite
touchstone run --eval example-case --harness echo --trials 2
touchstone run --harness droid --with-model A --with-model B  # compare models, same harness
touchstone run --workers 4          # run cells in parallel
touchstone run --resume <run_id>    # continue an interrupted run
touchstone report <run_id>          # (re)generate runs/<run_id>/report.md
touchstone export <run_id> [--push] # write runs/<id>/langfuse.json (and optionally push)

Comparing models on the same harness

The matrix is what answers "which model for my usecases?" — distinct models become distinct cells, and the report ranks them in a per-case matrix + a leaderboard (score, cost, time, tools, tokens). A case can declare the models inline (matrix.models / matrix.entries[].models), or you can hold a harness fixed and push models through it at run time without editing the case:

# Run these models on droid even if the cases declared only one — they replace the
# case's models for that harness. Each becomes its own row in the comparison.
touchstone run --harness droid \
  --with-model custom:glm-5.1:cloud-0 --with-model custom:glm-4.6:cloud

--with-model replaces the declared models (so you can introduce new ones); --model only filters the models a case already declares. Models are agent-specific opaque strings, so prefix HARNESS= (--with-model droid=A) to scope an override to one harness when a run spans several.

ACP agents are configured in acp_agents.yaml (see acp_agents.yaml.example); the built-in profiles (droid, gemini, codex, claude-acp, devin-cli) work out of the box once the agent's CLI is on PATH. evals/observed-droid/ is a worked example of a fully observed, interactive, multi-turn case.

Real harnesses (e.g. claude-code) cost money and require their CLI on PATH. The built-in echo harness runs the full loop with no network/API spend — use it for testing the framework itself.

Defining a case

See evals/example-case/case.yaml for a worked example. Schema:

id: my-case
description: ...
task:
  prompt: |
    What the model/agent must accomplish.
source:                  # optional; copied fresh into every cell sandbox
  path: ./source         # ...or  {repo: owner/name, commit: <sha>}  (pinned clone)
  # repo form may add `subdir: <dir>` to use just one sub-directory of the clone as the
  # sandbox — lets one fixtures repo hold many cases (see "Source fixtures repo" below).
artifacts:               # optional AI artifacts injected into the harness
  skills:   [./artifacts/skills/foo]
  commands: [./artifacts/commands/bar.md]
  mcp:      ./artifacts/.mcp.json
environment:             # optional per-cell dependency setup (the "broader sandbox")
  kind: pip-venv               # pip-venv (default) | uv | command  — how deps are provisioned
  requirements: [markupsafe]   # (pip-venv/uv) installed into an isolated venv per cell
  install: editable            # (pip-venv/uv) `pip install -e .` (src-layout pkg + its deps)
  # kind: command → run shell installs for project-local ecosystems, e.g.
  #   commands: ["npm ci"]      # node_modules / target/ etc. live in the sandbox
setup:                   # optional; introduce the task state after clone, before the agent
  stub: [{file: pkg/mod.py, function: target}]   # blank a fn body -> NotImplementedError
  run:  ["rm -rf .git"]                           # shell commands in the sandbox
matrix:
  harnesses: [claude-code]
  models:    [opus, sonnet, haiku]
  trials:    3
graders:
  - {type: command, cmd: "pytest -q", weight: 1.0}
  - {type: files, patterns: ["retry", "backoff"]}
  - {type: model_judge, rubric: ./graders/rubric.md, model: opus, pass_threshold: 0.8}
expect:
  pass_threshold: 1.0

Source fixtures repo

A case's bulky, hand-written assets — synthetic codebases to debug and the hidden/ oracle test suites — live out of this repo, in a separate fixtures repo (krimvp/touchstone-eval-fixtures), so they don't pollute the runner/eval tree. The eval repo keeps only the contract (task, graders, expectations); the fixtures repo holds the code. Each case has one directory there, split by visibility:

<case-id>/
  source/   # agent-VISIBLE input  → promoted to the sandbox before the agent runs
  hidden/   # grader ORACLE         → injected at grade time only; the agent never sees it

A case wires the two halves with two independent pins (both default-pinned by commit):

source:    {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
fixtures:  {repo: krimvp/touchstone-eval-fixtures, commit: <sha>}   # subdir defaults to <case-id>
graders:
  - {type: pytest, inject: ["./hidden/test_x.py"]}   # resolved under <case-id>/hidden/
  • source clones the repo, checks out the commit, and promotes <case-id>/source/ into the sandbox (no .git, like copy). SWE-bench-style cases point source at the real upstream repo instead, so they have only a hidden/ in the fixtures repo (no source/).
  • fixtures names the repo that graders resolve inject: paths against — Case.asset() pulls each hidden file from a host-cached clone (src/touchstone/fixtures.py), at grade time, after the agent has stopped. Because source/ and hidden/ are sibling directories and only source/ is promoted, the oracle can never leak into the agent's sandbox.

Keep the fixtures repo private for the anti-memorization cases. evals/example-case/ stays local (source: path) as the offline worked example / integration fixture.

Real-repo (SWE-bench-style) cases

A case can pin a real GitHub repo at a commit (source: {repo, commit}), setup.stub a function to blank its body, and inject hidden tests (oracle = the real function) only at grade time — so the agent reimplements real library code and the pytest grader scores the fraction of FAIL→PASS tests. See evals/repo-*-droid/.

When a repo needs third-party dependencies or isn't importable from its root (a src/ layout), declare an environment: each cell gets its own throwaway virtualenv, into which requirements are pip-installed and — with install: editable — the repo itself (pip install -e ., which resolves a src-layout package and pulls its deps). Every subprocess the cell spawns (harness, setup, and the command/pytest graders) runs under that venv via an explicit env, so dependency-bearing cases stay reproducible and parallel-safe (no shared site-packages). Worked examples: repo-smarttruncate-droid (a requirements dep) and repo-securefilename-droid (install: editable, src-layout).

Non-Python projects

Cases aren't Python-specific. The command, files, model_judge, and trace graders are language-agnostic, and the tests grader gives the same partial-credit scoring as pytest for any runner whose results it can read. Two substrates, XML primary with a console fallback:

  • JUnit XML (junit_xml: <glob>) — the universal report format every framework/build tool can emit (Maven Surefire, Gradle, pytest --junitxml, vitest/jest/mocha reporters, go-junit-report, cargo2junit). Deterministic, exact per-test counts, framework-agnostic.
  • Console summary (_parse_counts) — scraped when no XML report is produced: pytest/ unittest, node --test/TAP, Maven Surefire, go test -v (--- PASS:/--- FAIL:), and cargo test (test result: … N passed; M failed).

A tests grader with gate: true is a validity gate (never adds credit; disqualifies the cell to 0 on failure) — use it to mirror SWE-bench's PASS_TO_PASS regression gate in any language. inject takes either a bare filename (dropped at the sandbox root) or {src, dest} to place a hidden test at a runner-specific path (e.g. Maven's src/test/java/...). Use setup.run to blank the function (the AST-based setup.stub is Python-only); the implemented gate works on any language when pointed at explicit files. Worked examples: repo-js-wordwrap-droid (CommonJS, node --test), repo-java-camelcase-droid (Maven, Surefire), and the repo-swebench-* battery — real recent GitHub issues across Python, Go (go test), Java (Surefire + JUnit XML), JS/TS (mocha/ava/TAP), and Rust (cargo test).

Dependencies aren't special — how they're isolated is. Real projects have dependencies; the question is only whether installing them safely needs the environment venv. It depends on where the ecosystem puts deps:

Ecosystem Where deps go Isolation How to declare
Python shared site-packages (mutable) needs the per-cell venv environment: kind: pip-venv (or uv) + requirements / install: editable
Node / Rust / Go project-local (node_modules, target/, build cache) per-cell for free environment: kind: command + commands: ["npm ci"] etc.
Java / Maven shared ~/.m2 (versioned, immutable artifacts) safe to share across cells resolved by the build (mvn test)

The environment.kind is the one declarative knob (mirroring the Sandbox's Isolation Mode): pip-venv and uv build an isolated venv and install into it; command runs your install commands for ecosystems whose deps are project-local.

OS-level isolation + OS packages (containers)

For cases that need OS packages or a pinned, reproducible build/grade environment, declare a container: provisioning, setup.run, and the command/tests/pytest graders then run inside it (via docker exec), with the cell bind-mounted at its same path.

container:
  image: python:3.12-slim          # pin by digest (…@sha256:…) for full reproducibility
  setup: ["apt-get update -qq", "apt-get install -y -qq libxml2"]   # OS packages, once at start
  caches: [".cache/pip"]           # share the host's cache so cells don't re-download deps
environment:
  kind: pip-venv                   # the venv is now built *inside* the container
  requirements: [lxml, pytest]
graders:
  - {type: pytest, inject: ["./hidden/test_x.py"], weight: 4.0}     # runs in the container

caches mounts a home-relative dir (e.g. .cache/pip, .m2) shared with the host and across cells, so a fresh container per cell reuses already-downloaded dependencies instead of re-fetching them — the same shared-cache benefit the host's ~/.m2 gives today. The suite uses this on its dependency-bearing cases: repo-js-wordwrap (node:20-slim, zero-dep), repo-smarttruncate / repo-securefilename (python:3.12-slim + pip cache), and repo-java-camelcase (maven:3.9-eclipse-temurin-21 + shared ~/.m2).

Every provisioner and grader runs through the Cell's ExecutorLocalExecutor (host subprocess) by default, ContainerExecutor when a container is declared — so the same recipe runs under either backend (needs the docker daemon running). The Harness (the agent under test) still runs on the host against the bind-mounted Sandbox; running the agent itself in-container is future work. See docs/adr/0005.

So the earlier zero-dep examples were picked to keep the demo offline, not because deps are rare. repo-java-camelcase-droid is a genuinely dependency-bearing non-Python case: commons-text's source needs commons-lang3, which Maven resolves from Maven Central.

Bring your own private repos (reachability & fallback)

touchstone is an engine + a public sample battery. The verdict you can actually trust for "which model is best for me" comes from your own tasks, so the design is built to pull case material from external git repos you own — both the agent-visible source: {repo, commit} and the hidden oracle in fixtures: {repo, commit} — some of them private. Auth is just your normal git credentials (SSH agent / gh / a credential helper); nothing extra to configure.

Because a given host may not have access to every referenced repo (a teammate's private fixtures, a CI box without keys, an offline laptop), a run probes each case's external repos before doing any work (git ls-remote, cached per URL) and applies a policy:

touchstone run                                  # default: FAIL FAST if any required repo is unreachable
touchstone run --on-unavailable skip            # degrade: skip unreachable cases, run the rest
touchstone validate --check-access              # preflight only: report what a run would skip/fail on
  • Fail by default. A missing repo on a host you expected to be complete is a loud, early error — never a silently smaller benchmark (which would corrupt cross-model comparisons).
  • --on-unavailable skip degrades the unreachable cases to a skipped status: excluded from every score and the leaderboard, surfaced in a "Skipped (unavailable)" report section, and not counted as failures. Resume re-probes, so a transient outage is retried.
  • Per-case availability: optional marks a case that may reference a repo you might not have — it degrades to skipped even under the default fail mode.
  • Only access failures (no auth / no network / not found) are degradable; a bad commit or schema error is a defect and still fails loudly.

A fork can repoint the default hidden-fixtures repo to its own private one without editing every case by setting TOUCHSTONE_FIXTURES_REPO=owner/my-fixtures. Your fully-private held-out suite lives in evals-private/ (gitignored) and runs with --evals-dir evals-private — see its README. Design: docs/adr/0008-reachability-and-availability-policy.md.

Layout

evals/<case>/        the benchmark suite (one dir per case)
src/touchstone/     the framework (config, harness/, grader/, runner, report, cli)
runs/<run_id>/       results (gitignored): manifest.json + cells/ + report.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

touchstone_eval-0.1.0.tar.gz (305.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

touchstone_eval-0.1.0-py3-none-any.whl (93.1 kB view details)

Uploaded Python 3

File details

Details for the file touchstone_eval-0.1.0.tar.gz.

File metadata

  • Download URL: touchstone_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 305.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.12

File hashes

Hashes for touchstone_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b8b4365414ba5f74af7f77772af3fef3d04046e57093cdae60bab86145e5f66e
MD5 46f7657fabdb8ca999d759c598ce8dbb
BLAKE2b-256 632cdccf4bdb9aa9e4642faacbca28d96253195f392a4c1eb570571526b717f7

See more details on using hashes here.

File details

Details for the file touchstone_eval-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for touchstone_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b08f7d93157c096ff22c88f047911316105731d4991fc3084443b505582fd5d
MD5 99cd2a11c1bc2664cb00d3f1a2bc311d
BLAKE2b-256 0ea9ee13e3bc3a6f08e478db1d6eeaa1eb8115a6eec6bec0bf24338ebaf9e18d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page