Skip to main content

Personal eval benchmark: compare model outcomes across swappable CLI-agent harnesses on custom tasks.

Project description

touchstone

touchstone

A touchstone is the dark stone jewelers rub gold against to read its purity from the streak it leaves — telling true gold from convincing fakes. That is this benchmark's whole job: telling apart models that look identical on paper, by the marks they leave on real work.

A personal eval benchmark for answering one question: for my usecases, which model works best?

📖 Documentation: https://krimvp.github.io/touchstone/ — quickstart, the any-provider OpenAI-compatible harness, case authoring, graders, observation & interaction, and the architecture/ADRs.

Each eval (a case) bundles its own task, its own input source files, its own AI artifacts (skills / commands / plugins / MCP), and its own definition of a correct outcome. A run executes a matrix of cells — one cell per (case × harness × model × trial) — fully isolated and persisted independently, then aggregates everything into a single report.

Core model

Case (one eval)            Matrix axes            Cell (unit of work + persistence)
  task / prompt      ×   harnesses[]        =     sandbox + transcript + output
  source/ files          models[]                 + grader scores + metrics + status
  artifacts/             trials (k)
  graders[]
  • Harness — the swappable thing that turns a task into an output, behind one interface (harness/base.py). echo (fake) and claude-code are output-only. For rich runs (a Trace of tool calls / tokens / cost) there are three paths: claude-code-stream drives Claude natively over --output-format stream-json (no ACP, no Node; Tracing-only, autonomous via skip-permissions — see docs/adr/0006); the ACP adapter aims to drive any Agent Client Protocol agent (droid, gemini, codex, claude-acp, devin-cli) through one contract with observation and bidirectional interaction — today partial and uneven across providers (not every capability is reachable from every agent), kept as the future unifier (ADR 0010); and the generic openai adapter runs an in-process agentic loop against any OpenAI-compatible endpoint — no vendor CLI, no Claude, just OPENAI_BASE_URL + model (Ollama, vLLM, LM Studio, a LiteLLM proxy, OpenRouter, real OpenAI, …; Tracing and Interaction capable — it mediates every tool call through the Case's policy; see docs/adr/0009). ACP is one rich path, not the only one — the Trace is the contract.
  • Graderscommand (run tests/build), files (expected files / grep patterns), model_judge (LLM-as-judge), and trace (assert over observed tool usage / token & cost budgets). All run; combined per the case's expect.pass_threshold.
  • Observation & interaction (opt-in per case via observe:) — capture a normalized Trace (tool calls, tokens, cost, permission events) and answer the agent's mid-run requests with an Interaction Policy (auto-approve/auto-deny/scripted/ llm-based/manual). See CONTEXT.md + docs/adr/.
  • Resumability & parallelism — each cell's result.json is the source of truth (the manifest is a derived index), so cells run in parallel (--workers) without contention and run --resume <id> continues after a crash.

Install

The published package is touchstone-eval; the command it installs is touchstone (the bare touchstone name on PyPI belongs to an unrelated, abandoned project).

uvx touchstone-eval --help          # run without installing (recommended)
pipx install touchstone-eval        # or install as an isolated tool
pip install touchstone-eval         # or into the current environment

Add the optional extras when you need them: touchstone-eval[judge] (Anthropic SDK for model_judge), [langfuse] (export), [dev] (pytest).

For local development from a checkout:

pip install -e ".[judge,dev]"   # judge = Anthropic SDK for model_judge; dev = pytest

Usage

touchstone validate                 # schema-check every evals/<case>/case.yaml
touchstone list                     # list cases and past runs
touchstone run                      # run the whole evals/ suite
touchstone run --eval example-case --harness echo --trials 2
touchstone run --harness droid --with-model A --with-model B  # compare models, same harness
touchstone run --workers 4          # run cells in parallel
touchstone run --resume <run_id>    # continue an interrupted run
touchstone report <run_id>          # (re)generate runs/<run_id>/report.md
touchstone export <run_id> [--push] # write runs/<id>/langfuse.json (and optionally push)

Comparing models on the same harness

The matrix is what answers "which model for my usecases?" — distinct models become distinct cells, and the report ranks them in a per-case matrix + a leaderboard (score, cost, time, tools, tokens). A case can declare the models inline (matrix.models / matrix.entries[].models), or you can hold a harness fixed and push models through it at run time without editing the case:

# Run these models on droid even if the cases declared only one — they replace the
# case's models for that harness. Each becomes its own row in the comparison.
touchstone run --harness droid \
  --with-model custom:glm-5.1:cloud-0 --with-model custom:glm-4.6:cloud

--with-model replaces the declared models (so you can introduce new ones); --model only filters the models a case already declares. Models are agent-specific opaque strings, so prefix HARNESS= (--with-model droid=A) to scope an override to one harness when a run spans several.

ACP agents are configured in acp_agents.yaml (see acp_agents.yaml.example); the built-in profiles (droid, gemini, codex, claude-acp, devin-cli) work out of the box once the agent's CLI is on PATH. evals/observed-droid/ is a worked example of a fully observed, interactive, multi-turn case.

To run without any vendor CLI — any model from any OpenAI-compatible endpoint — use the generic openai adapter (pip install touchstone-eval[openai]). It's not provider-specific: point it with the OpenAI SDK's standard env vars (OPENAI_BASE_URL + OPENAI_API_KEY, the convention Ollama/vLLM/LM Studio/LiteLLM/OpenRouter all share) and the model comes from the matrix. Swap the endpoint and the same harness runs anywhere — you're never tied to one provider:

export OPENAI_BASE_URL=http://localhost:11434/v1   # Ollama; swap for any other endpoint
# export OPENAI_API_KEY=sk-...                      # omit for keyless local servers
touchstone run --harness openai --with-model llama3.1

That's the whole setup. openai_agents.yaml is optional — use it only to pin named endpoints (so a single run can compare several at once) or attach a price table; see openai_agents.yaml.example.

Real harnesses (e.g. claude-code) cost money and require their CLI on PATH. The built-in echo harness runs the full loop with no network/API spend — use it for testing the framework itself.

Defining a case

See evals/example-case/case.yaml for a worked example. Schema:

id: my-case
description: ...
task:
  prompt: |
    What the model/agent must accomplish.
source:                  # optional; copied fresh into every cell sandbox
  path: ./source         # ...or  {repo: owner/name, commit: <sha>}  (pinned clone)
  # repo form may add `subdir: <dir>` to use just one sub-directory of the clone as the
  # sandbox — lets one fixtures repo hold many cases (see "Source fixtures repo" below).
artifacts:               # optional AI artifacts injected into the harness
  skills:   [./artifacts/skills/foo]
  commands: [./artifacts/commands/bar.md]
  mcp:      ./artifacts/.mcp.json
environment:             # optional per-cell dependency setup (the "broader sandbox")
  kind: pip-venv               # pip-venv (default) | uv | command  — how deps are provisioned
  requirements: [markupsafe]   # (pip-venv/uv) installed into an isolated venv per cell
  install: editable            # (pip-venv/uv) `pip install -e .` (src-layout pkg + its deps)
  # kind: command → run shell installs for project-local ecosystems, e.g.
  #   commands: ["npm ci"]      # node_modules / target/ etc. live in the sandbox
setup:                   # optional; introduce the task state after clone, before the agent
  stub: [{file: pkg/mod.py, function: target}]   # blank a fn body -> NotImplementedError
  run:  ["rm -rf .git"]                           # shell commands in the sandbox
matrix:
  harnesses: [claude-code]
  models:    [opus, sonnet, haiku]
  trials:    3
graders:
  - {type: command, cmd: "pytest -q", weight: 1.0}
  - {type: files, patterns: ["retry", "backoff"]}
  - {type: model_judge, rubric: ./graders/rubric.md, model: opus, pass_threshold: 0.8}
expect:
  pass_threshold: 1.0

Source fixtures repo

A case's bulky, hand-written assets — synthetic codebases to debug and the hidden/ oracle test suites — live out of this repo, in a separate fixtures repo (krimvp/touchstone-eval-fixtures), so they don't pollute the runner/eval tree. The eval repo keeps only the contract (task, graders, expectations); the fixtures repo holds the code. Each case has one directory there, split by visibility:

<case-id>/
  source/   # agent-VISIBLE input  → promoted to the sandbox before the agent runs
  hidden/   # grader ORACLE         → injected at grade time only; the agent never sees it

A case wires the two halves with two independent pins (both default-pinned by commit):

source:    {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
fixtures:  {repo: krimvp/touchstone-eval-fixtures, commit: <sha>}   # subdir defaults to <case-id>
graders:
  - {type: pytest, inject: ["./hidden/test_x.py"]}   # resolved under <case-id>/hidden/
  • source clones the repo, checks out the commit, and promotes <case-id>/source/ into the sandbox (no .git, like copy). SWE-bench-style cases point source at the real upstream repo instead, so they have only a hidden/ in the fixtures repo (no source/).
  • fixtures names the repo that graders resolve inject: paths against — Case.asset() pulls each hidden file from a host-cached clone (src/touchstone/fixtures.py), at grade time, after the agent has stopped. Because source/ and hidden/ are sibling directories and only source/ is promoted, the oracle can never leak into the agent's sandbox.

Keep the fixtures repo private for the anti-memorization cases. evals/example-case/ stays local (source: path) as the offline worked example / integration fixture.

Real-repo (SWE-bench-style) cases

A case can pin a real GitHub repo at a commit (source: {repo, commit}), setup.stub a function to blank its body, and inject hidden tests (oracle = the real function) only at grade time — so the agent reimplements real library code and the pytest grader scores the fraction of FAIL→PASS tests. See evals/repo-*-droid/.

When a repo needs third-party dependencies or isn't importable from its root (a src/ layout), declare an environment: each cell gets its own throwaway virtualenv, into which requirements are pip-installed and — with install: editable — the repo itself (pip install -e ., which resolves a src-layout package and pulls its deps). Every subprocess the cell spawns (harness, setup, and the command/pytest graders) runs under that venv via an explicit env, so dependency-bearing cases stay reproducible and parallel-safe (no shared site-packages). Worked examples: repo-smarttruncate-droid (a requirements dep) and repo-securefilename-droid (install: editable, src-layout).

Non-Python projects

Cases aren't Python-specific. The command, files, model_judge, and trace graders are language-agnostic, and the tests grader gives the same partial-credit scoring as pytest for any runner whose results it can read. Two substrates, XML primary with a console fallback:

  • JUnit XML (junit_xml: <glob>) — the universal report format every framework/build tool can emit (Maven Surefire, Gradle, pytest --junitxml, vitest/jest/mocha reporters, go-junit-report, cargo2junit). Deterministic, exact per-test counts, framework-agnostic.
  • Console summary (_parse_counts) — scraped when no XML report is produced: pytest/ unittest, node --test/TAP, Maven Surefire, go test -v (--- PASS:/--- FAIL:), and cargo test (test result: … N passed; M failed).

A tests grader with gate: true is a validity gate (never adds credit; disqualifies the cell to 0 on failure) — use it to mirror SWE-bench's PASS_TO_PASS regression gate in any language. inject takes either a bare filename (dropped at the sandbox root) or {src, dest} to place a hidden test at a runner-specific path (e.g. Maven's src/test/java/...). Use setup.run to blank the function (the AST-based setup.stub is Python-only); the implemented gate works on any language when pointed at explicit files. Worked examples: repo-js-wordwrap-droid (CommonJS, node --test), repo-java-camelcase-droid (Maven, Surefire), and the repo-swebench-* battery — real recent GitHub issues across Python, Go (go test), Java (Surefire + JUnit XML), JS/TS (mocha/ava/TAP), and Rust (cargo test).

Dependencies aren't special — how they're isolated is. Real projects have dependencies; the question is only whether installing them safely needs the environment venv. It depends on where the ecosystem puts deps:

Ecosystem Where deps go Isolation How to declare
Python shared site-packages (mutable) needs the per-cell venv environment: kind: pip-venv (or uv) + requirements / install: editable
Node / Rust / Go project-local (node_modules, target/, build cache) per-cell for free environment: kind: command + commands: ["npm ci"] etc.
Java / Maven shared ~/.m2 (versioned, immutable artifacts) safe to share across cells resolved by the build (mvn test)

The environment.kind is the one declarative knob (mirroring the Sandbox's Isolation Mode): pip-venv and uv build an isolated venv and install into it; command runs your install commands for ecosystems whose deps are project-local.

OS-level isolation + OS packages (containers)

For cases that need OS packages or a pinned, reproducible build/grade environment, declare a container: provisioning, setup.run, and the command/tests/pytest graders then run inside it (via docker exec), with the cell bind-mounted at its same path.

container:
  image: python:3.12-slim          # pin by digest (…@sha256:…) for full reproducibility
  setup: ["apt-get update -qq", "apt-get install -y -qq libxml2"]   # OS packages, once at start
  caches: [".cache/pip"]           # share the host's cache so cells don't re-download deps
environment:
  kind: pip-venv                   # the venv is now built *inside* the container
  requirements: [lxml, pytest]
graders:
  - {type: pytest, inject: ["./hidden/test_x.py"], weight: 4.0}     # runs in the container

caches mounts a home-relative dir (e.g. .cache/pip, .m2) shared with the host and across cells, so a fresh container per cell reuses already-downloaded dependencies instead of re-fetching them — the same shared-cache benefit the host's ~/.m2 gives today. The suite uses this on its dependency-bearing cases: repo-js-wordwrap (node:20-slim, zero-dep), repo-smarttruncate / repo-securefilename (python:3.12-slim + pip cache), and repo-java-camelcase (maven:3.9-eclipse-temurin-21 + shared ~/.m2).

Every provisioner and grader runs through the Cell's ExecutorLocalExecutor (host subprocess) by default, ContainerExecutor when a container is declared — so the same recipe runs under either backend (needs the docker daemon running). The Harness (the agent under test) still runs on the host against the bind-mounted Sandbox; running the agent itself in-container is future work. See docs/adr/0005.

So the earlier zero-dep examples were picked to keep the demo offline, not because deps are rare. repo-java-camelcase-droid is a genuinely dependency-bearing non-Python case: commons-text's source needs commons-lang3, which Maven resolves from Maven Central.

Bring your own private repos (reachability & fallback)

touchstone is an engine + a public sample battery. The verdict you can actually trust for "which model is best for me" comes from your own tasks, so the design is built to pull case material from external git repos you own — both the agent-visible source: {repo, commit} and the hidden oracle in fixtures: {repo, commit} — some of them private. Auth is just your normal git credentials (SSH agent / gh / a credential helper); nothing extra to configure.

Because a given host may not have access to every referenced repo (a teammate's private fixtures, a CI box without keys, an offline laptop), a run probes each case's external repos before doing any work (git ls-remote, cached per URL) and applies a policy:

touchstone run                                  # default: FAIL FAST if any required repo is unreachable
touchstone run --on-unavailable skip            # degrade: skip unreachable cases, run the rest
touchstone validate --check-access              # preflight only: report what a run would skip/fail on
  • Fail by default. A missing repo on a host you expected to be complete is a loud, early error — never a silently smaller benchmark (which would corrupt cross-model comparisons).
  • --on-unavailable skip degrades the unreachable cases to a skipped status: excluded from every score and the leaderboard, surfaced in a "Skipped (unavailable)" report section, and not counted as failures. Resume re-probes, so a transient outage is retried.
  • Per-case availability: optional marks a case that may reference a repo you might not have — it degrades to skipped even under the default fail mode.
  • Only access failures (no auth / no network / not found) are degradable; a bad commit or schema error is a defect and still fails loudly.

A fork can repoint the default hidden-fixtures repo to its own private one without editing every case by setting TOUCHSTONE_FIXTURES_REPO=owner/my-fixtures. Your fully-private held-out suite lives in evals-private/ (gitignored) and runs with --evals-dir evals-private — see its README. Design: docs/adr/0008-reachability-and-availability-policy.md.

Layout

evals/<case>/        the benchmark suite (one dir per case)
src/touchstone/     the framework (config, harness/, grader/, runner, report, cli)
runs/<run_id>/       results (gitignored): manifest.json + cells/ + report.md

Contributing

See AGENTS.md for working instructions (dev setup, commands, conventions, and the required docs/landing-page step before a change is done) and CONTEXT.md for the project glossary. Keeping the landing page (README.md, docs/index.md) and the docs site in sync with behavior is a required part of every change.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

touchstone_eval-0.1.1.tar.gz (121.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

touchstone_eval-0.1.1-py3-none-any.whl (142.0 kB view details)

Uploaded Python 3

File details

Details for the file touchstone_eval-0.1.1.tar.gz.

File metadata

  • Download URL: touchstone_eval-0.1.1.tar.gz
  • Upload date:
  • Size: 121.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for touchstone_eval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 338bf28aa98004d1f1e77889410ecf0f95e8713c9a2886ef280a93d6944c6778
MD5 6ebe37e28a57fb10341e00942c61aaa9
BLAKE2b-256 e6fd7909c6e2b6314c02a0e8c206263c05192295cb4065ce08e848287b04f00c

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_eval-0.1.1.tar.gz:

Publisher: publish.yml on krimvp/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file touchstone_eval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: touchstone_eval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 142.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for touchstone_eval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 338875d8bc83cdafa0dceeb0af4baf7c222af6d99250f9e33814927ddc5ff2f8
MD5 28e9320f71c1390d3060c9f99ef0cc31
BLAKE2b-256 f7bc4198497f27ca689d5649eca7ab87c36636174c1b4f02ba81117cd156d390

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_eval-0.1.1-py3-none-any.whl:

Publisher: publish.yml on krimvp/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page