Personal eval benchmark: compare model outcomes across swappable CLI-agent harnesses on custom tasks.
Project description
touchstone
A touchstone is the dark stone jewelers rub gold against to read its purity from the streak it leaves — telling true gold from convincing fakes. That is this benchmark's whole job: telling apart models that look identical on paper, by the marks they leave on real work.
A personal eval benchmark for answering one question: for my usecases, which model works best?
📖 Documentation: https://krimvp.github.io/touchstone/ — quickstart, the any-provider OpenAI-compatible harness, case authoring, graders, observation & interaction, and the architecture/ADRs.
Each eval (a case) bundles its own task, its own input source files, its own AI
artifacts (skills / commands / plugins / MCP), and its own definition of a correct
outcome. A run executes a matrix of cells — one cell per
(case × harness × model × trial) — fully isolated and persisted independently, then
aggregates everything into a single report.
Core model
Case (one eval) Matrix axes Cell (unit of work + persistence)
task / prompt × harnesses[] = sandbox + transcript + output
source/ files models[] + grader scores + metrics + status
artifacts/ trials (k)
graders[]
- Harness — the swappable thing that turns a task into an output, behind one interface
(
harness/base.py).echo(fake) andclaude-codeare output-only. For rich runs (a Trace of tool calls / tokens / cost) there are three paths:claude-code-streamdrives Claude natively over--output-format stream-json(no ACP, no Node; Tracing-only, autonomous via skip-permissions — seedocs/adr/0006); the ACP adapter aims to drive any Agent Client Protocol agent (droid, gemini, codex, claude-acp, devin-cli) through one contract with observation and bidirectional interaction — today partial and uneven across providers (not every capability is reachable from every agent), kept as the future unifier (ADR 0010); and the genericopenaiadapter runs an in-process agentic loop against any OpenAI-compatible endpoint — no vendor CLI, no Claude, justOPENAI_BASE_URL+model(Ollama, vLLM, LM Studio, a LiteLLM proxy, OpenRouter, real OpenAI, …; Tracing and Interaction capable — it mediates every tool call through the Case's policy; seedocs/adr/0009). ACP is one rich path, not the only one — the Trace is the contract. - Graders —
command(run tests/build),files(expected files / grep patterns),model_judge(LLM-as-judge), andtrace(assert over observed tool usage / token & cost budgets). All run; combined per the case'sexpect.pass_threshold. - Observation & interaction (opt-in per case via
observe:) — capture a normalized Trace (tool calls, tokens, cost, permission events) and answer the agent's mid-run requests with an Interaction Policy (auto-approve/auto-deny/scripted/llm-based/manual). SeeCONTEXT.md+docs/adr/. - Resumability & parallelism — each cell's
result.jsonis the source of truth (the manifest is a derived index), so cells run in parallel (--workers) without contention andrun --resume <id>continues after a crash.
Install
The published package is touchstone-eval; the command it installs is touchstone
(the bare touchstone name on PyPI belongs to an unrelated, abandoned project).
uvx touchstone-eval --help # run without installing (recommended)
pipx install touchstone-eval # or install as an isolated tool
pip install touchstone-eval # or into the current environment
Add the optional extras when you need them: touchstone-eval[judge] (Anthropic SDK for
model_judge), [langfuse] (export), [dev] (pytest).
For local development from a checkout:
pip install -e ".[judge,dev]" # judge = Anthropic SDK for model_judge; dev = pytest
Usage
touchstone validate # schema-check every evals/<case>/case.yaml
touchstone list # list cases and past runs
touchstone run # run the whole evals/ suite
touchstone run --eval example-case --harness echo --trials 2
touchstone run --harness droid --with-model A --with-model B # compare models, same harness
touchstone run --workers 4 # run cells in parallel
touchstone run --resume <run_id> # continue an interrupted run
touchstone report <run_id> # (re)generate runs/<run_id>/report.md
touchstone export <run_id> [--push] # write runs/<id>/langfuse.json (and optionally push)
Comparing models on the same harness
The matrix is what answers "which model for my usecases?" — distinct models become
distinct cells, and the report ranks them in a per-case matrix + a leaderboard (score,
cost, time, tools, tokens). A case can declare the models inline
(matrix.models / matrix.entries[].models), or you can hold a harness fixed and push
models through it at run time without editing the case:
# Run these models on droid even if the cases declared only one — they replace the
# case's models for that harness. Each becomes its own row in the comparison.
touchstone run --harness droid \
--with-model custom:glm-5.1:cloud-0 --with-model custom:glm-4.6:cloud
--with-model replaces the declared models (so you can introduce new ones); --model
only filters the models a case already declares. Models are agent-specific opaque
strings, so prefix HARNESS= (--with-model droid=A) to scope an override to one harness
when a run spans several.
ACP agents are configured in acp_agents.yaml (see acp_agents.yaml.example); the
built-in profiles (droid, gemini, codex, claude-acp, devin-cli) work out of the
box once the agent's CLI is on PATH. evals/observed-droid/ is a worked example of a
fully observed, interactive, multi-turn case.
To run without any vendor CLI — any model from any OpenAI-compatible endpoint — use the
generic openai adapter (pip install touchstone-eval[openai]). It's not provider-specific:
point it with the OpenAI SDK's standard env vars (OPENAI_BASE_URL + OPENAI_API_KEY, the
convention Ollama/vLLM/LM Studio/LiteLLM/OpenRouter all share) and the model comes from the
matrix. Swap the endpoint and the same harness runs anywhere — you're never tied to one
provider:
export OPENAI_BASE_URL=http://localhost:11434/v1 # Ollama; swap for any other endpoint
# export OPENAI_API_KEY=sk-... # omit for keyless local servers
touchstone run --harness openai --with-model llama3.1
That's the whole setup. openai_agents.yaml is optional — use it only to pin named
endpoints (so a single run can compare several at once) or attach a price table; see
openai_agents.yaml.example.
Real harnesses (e.g. claude-code) cost money and require their CLI on PATH.
The built-in echo harness runs the full loop with no network/API spend — use it for
testing the framework itself.
Defining a case
See evals/example-case/case.yaml for a worked example. Schema:
id: my-case
description: ...
task:
prompt: |
What the model/agent must accomplish.
source: # optional; copied fresh into every cell sandbox
path: ./source # ...or {repo: owner/name, commit: <sha>} (pinned clone)
# repo form may add `subdir: <dir>` to use just one sub-directory of the clone as the
# sandbox — lets one fixtures repo hold many cases (see "Source fixtures repo" below).
artifacts: # optional AI artifacts injected into the harness
skills: [./artifacts/skills/foo]
commands: [./artifacts/commands/bar.md]
mcp: ./artifacts/.mcp.json
environment: # optional per-cell dependency setup (the "broader sandbox")
kind: pip-venv # pip-venv (default) | uv | command — how deps are provisioned
requirements: [markupsafe] # (pip-venv/uv) installed into an isolated venv per cell
install: editable # (pip-venv/uv) `pip install -e .` (src-layout pkg + its deps)
# kind: command → run shell installs for project-local ecosystems, e.g.
# commands: ["npm ci"] # node_modules / target/ etc. live in the sandbox
setup: # optional; introduce the task state after clone, before the agent
stub: [{file: pkg/mod.py, function: target}] # blank a fn body -> NotImplementedError
run: ["rm -rf .git"] # shell commands in the sandbox
matrix:
harnesses: [claude-code]
models: [opus, sonnet, haiku]
trials: 3
graders:
- {type: command, cmd: "pytest -q", weight: 1.0}
- {type: files, patterns: ["retry", "backoff"]}
- {type: model_judge, rubric: ./graders/rubric.md, model: opus, pass_threshold: 0.8}
expect:
pass_threshold: 1.0
Source fixtures repo
A case's bulky, hand-written assets — synthetic codebases to debug and the hidden/
oracle test suites — live out of this repo, in a separate fixtures repo
(krimvp/touchstone-eval-fixtures), so they
don't pollute the runner/eval tree. The eval repo keeps only the contract (task, graders,
expectations); the fixtures repo holds the code. Each case has one directory there, split by
visibility:
<case-id>/
source/ # agent-VISIBLE input → promoted to the sandbox before the agent runs
hidden/ # grader ORACLE → injected at grade time only; the agent never sees it
A case wires the two halves with two independent pins (both default-pinned by commit):
source: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
fixtures: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>} # subdir defaults to <case-id>
graders:
- {type: pytest, inject: ["./hidden/test_x.py"]} # resolved under <case-id>/hidden/
sourceclones the repo, checks out the commit, and promotes<case-id>/source/into the sandbox (no.git, likecopy). SWE-bench-style cases pointsourceat the real upstream repo instead, so they have only ahidden/in the fixtures repo (nosource/).fixturesnames the repo that graders resolveinject:paths against —Case.asset()pulls each hidden file from a host-cached clone (src/touchstone/fixtures.py), at grade time, after the agent has stopped. Becausesource/andhidden/are sibling directories and onlysource/is promoted, the oracle can never leak into the agent's sandbox.
Keep the fixtures repo private for the anti-memorization cases. evals/example-case/
stays local (source: path) as the offline worked example / integration fixture.
Real-repo (SWE-bench-style) cases
A case can pin a real GitHub repo at a commit (source: {repo, commit}), setup.stub a
function to blank its body, and inject hidden tests (oracle = the real function) only
at grade time — so the agent reimplements real library code and the pytest grader scores
the fraction of FAIL→PASS tests. See evals/repo-*-droid/.
When a repo needs third-party dependencies or isn't importable from its root (a src/
layout), declare an environment: each cell gets its own throwaway virtualenv, into
which requirements are pip-installed and — with install: editable — the repo itself
(pip install -e ., which resolves a src-layout package and pulls its deps). Every
subprocess the cell spawns (harness, setup, and the command/pytest graders) runs under
that venv via an explicit env, so dependency-bearing cases stay reproducible and
parallel-safe (no shared site-packages). Worked examples: repo-smarttruncate-droid
(a requirements dep) and repo-securefilename-droid (install: editable, src-layout).
Non-Python projects
Cases aren't Python-specific. The command, files, model_judge, and trace graders
are language-agnostic, and the tests grader gives the same partial-credit scoring as
pytest for any runner whose results it can read. Two substrates, XML primary with a
console fallback:
- JUnit XML (
junit_xml: <glob>) — the universal report format every framework/build tool can emit (Maven Surefire, Gradle, pytest--junitxml, vitest/jest/mocha reporters,go-junit-report,cargo2junit). Deterministic, exact per-test counts, framework-agnostic. - Console summary (
_parse_counts) — scraped when no XML report is produced: pytest/ unittest,node --test/TAP, Maven Surefire,go test -v(--- PASS:/--- FAIL:), andcargo test(test result: … N passed; M failed).
A tests grader with gate: true is a validity gate (never adds credit; disqualifies the
cell to 0 on failure) — use it to mirror SWE-bench's PASS_TO_PASS regression gate in any
language. inject takes either a bare filename (dropped at the sandbox root) or {src, dest}
to place a hidden test at a runner-specific path (e.g. Maven's src/test/java/...). Use
setup.run to blank the function (the AST-based setup.stub is Python-only); the
implemented gate works on any language when pointed at explicit files. Worked examples:
repo-js-wordwrap-droid (CommonJS, node --test), repo-java-camelcase-droid (Maven,
Surefire), and the repo-swebench-* battery — real recent GitHub issues across Python, Go
(go test), Java (Surefire + JUnit XML), JS/TS (mocha/ava/TAP), and Rust (cargo test).
Dependencies aren't special — how they're isolated is. Real projects have
dependencies; the question is only whether installing them safely needs the environment
venv. It depends on where the ecosystem puts deps:
| Ecosystem | Where deps go | Isolation | How to declare |
|---|---|---|---|
| Python | shared site-packages (mutable) |
needs the per-cell venv | environment: kind: pip-venv (or uv) + requirements / install: editable |
| Node / Rust / Go | project-local (node_modules, target/, build cache) |
per-cell for free | environment: kind: command + commands: ["npm ci"] etc. |
| Java / Maven | shared ~/.m2 (versioned, immutable artifacts) |
safe to share across cells | resolved by the build (mvn test) |
The environment.kind is the one declarative knob (mirroring the Sandbox's Isolation Mode):
pip-venv and uv build an isolated venv and install into it; command runs your install
commands for ecosystems whose deps are project-local.
OS-level isolation + OS packages (containers)
For cases that need OS packages or a pinned, reproducible build/grade environment, declare a
container: provisioning, setup.run, and the command/tests/pytest graders then run
inside it (via docker exec), with the cell bind-mounted at its same path.
container:
image: python:3.12-slim # pin by digest (…@sha256:…) for full reproducibility
setup: ["apt-get update -qq", "apt-get install -y -qq libxml2"] # OS packages, once at start
caches: [".cache/pip"] # share the host's cache so cells don't re-download deps
environment:
kind: pip-venv # the venv is now built *inside* the container
requirements: [lxml, pytest]
graders:
- {type: pytest, inject: ["./hidden/test_x.py"], weight: 4.0} # runs in the container
caches mounts a home-relative dir (e.g. .cache/pip, .m2) shared with the host and
across cells, so a fresh container per cell reuses already-downloaded dependencies instead
of re-fetching them — the same shared-cache benefit the host's ~/.m2 gives today. The
suite uses this on its dependency-bearing cases: repo-js-wordwrap (node:20-slim,
zero-dep), repo-smarttruncate / repo-securefilename (python:3.12-slim + pip cache),
and repo-java-camelcase (maven:3.9-eclipse-temurin-21 + shared ~/.m2).
Every provisioner and grader runs through the Cell's Executor — LocalExecutor (host
subprocess) by default, ContainerExecutor when a container is declared — so the same
recipe runs under either backend (needs the docker daemon running). The Harness (the agent
under test) still runs on the host against the bind-mounted Sandbox; running the agent
itself in-container is future work. See docs/adr/0005.
So the earlier zero-dep examples were picked to keep the demo offline, not because deps
are rare. repo-java-camelcase-droid is a genuinely dependency-bearing non-Python case:
commons-text's source needs commons-lang3, which Maven resolves from Maven Central.
Bring your own private repos (reachability & fallback)
touchstone is an engine + a public sample battery. The verdict you can actually trust for
"which model is best for me" comes from your own tasks, so the design is built to pull
case material from external git repos you own — both the agent-visible source: {repo, commit}
and the hidden oracle in fixtures: {repo, commit} — some of them private. Auth is just your
normal git credentials (SSH agent / gh / a credential helper); nothing extra to configure.
Because a given host may not have access to every referenced repo (a teammate's private
fixtures, a CI box without keys, an offline laptop), a run probes each case's external repos
before doing any work (git ls-remote, cached per URL) and applies a policy:
touchstone run # default: FAIL FAST if any required repo is unreachable
touchstone run --on-unavailable skip # degrade: skip unreachable cases, run the rest
touchstone validate --check-access # preflight only: report what a run would skip/fail on
- Fail by default. A missing repo on a host you expected to be complete is a loud, early error — never a silently smaller benchmark (which would corrupt cross-model comparisons).
--on-unavailable skipdegrades the unreachable cases to askippedstatus: excluded from every score and the leaderboard, surfaced in a "Skipped (unavailable)" report section, and not counted as failures. Resume re-probes, so a transient outage is retried.- Per-case
availability: optionalmarks a case that may reference a repo you might not have — it degrades toskippedeven under the default fail mode. - Only access failures (no auth / no network / not found) are degradable; a bad commit or schema error is a defect and still fails loudly.
A fork can repoint the default hidden-fixtures repo to its own private one without editing every
case by setting TOUCHSTONE_FIXTURES_REPO=owner/my-fixtures. Your fully-private held-out suite
lives in evals-private/ (gitignored) and runs with --evals-dir evals-private — see its
README. Design: docs/adr/0008-reachability-and-availability-policy.md.
Layout
evals/<case>/ the benchmark suite (one dir per case)
src/touchstone/ the framework (config, harness/, grader/, runner, report, cli)
runs/<run_id>/ results (gitignored): manifest.json + cells/ + report.md
Contributing
See AGENTS.md for working instructions (dev setup, commands, conventions, and
the required docs/landing-page step before a change is done) and CONTEXT.md
for the project glossary. Keeping the landing page (README.md, docs/index.md) and the
docs site in sync with behavior is a required part of every change.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file touchstone_eval-0.1.1.tar.gz.
File metadata
- Download URL: touchstone_eval-0.1.1.tar.gz
- Upload date:
- Size: 121.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
338bf28aa98004d1f1e77889410ecf0f95e8713c9a2886ef280a93d6944c6778
|
|
| MD5 |
6ebe37e28a57fb10341e00942c61aaa9
|
|
| BLAKE2b-256 |
e6fd7909c6e2b6314c02a0e8c206263c05192295cb4065ce08e848287b04f00c
|
Provenance
The following attestation bundles were made for touchstone_eval-0.1.1.tar.gz:
Publisher:
publish.yml on krimvp/touchstone
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
touchstone_eval-0.1.1.tar.gz -
Subject digest:
338bf28aa98004d1f1e77889410ecf0f95e8713c9a2886ef280a93d6944c6778 - Sigstore transparency entry: 2000457575
- Sigstore integration time:
-
Permalink:
krimvp/touchstone@8cbfc122e806638bff5dbd7ff9d562b42b36ef03 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/krimvp
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8cbfc122e806638bff5dbd7ff9d562b42b36ef03 -
Trigger Event:
push
-
Statement type:
File details
Details for the file touchstone_eval-0.1.1-py3-none-any.whl.
File metadata
- Download URL: touchstone_eval-0.1.1-py3-none-any.whl
- Upload date:
- Size: 142.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
338875d8bc83cdafa0dceeb0af4baf7c222af6d99250f9e33814927ddc5ff2f8
|
|
| MD5 |
28e9320f71c1390d3060c9f99ef0cc31
|
|
| BLAKE2b-256 |
f7bc4198497f27ca689d5649eca7ab87c36636174c1b4f02ba81117cd156d390
|
Provenance
The following attestation bundles were made for touchstone_eval-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on krimvp/touchstone
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
touchstone_eval-0.1.1-py3-none-any.whl -
Subject digest:
338875d8bc83cdafa0dceeb0af4baf7c222af6d99250f9e33814927ddc5ff2f8 - Sigstore transparency entry: 2000457676
- Sigstore integration time:
-
Permalink:
krimvp/touchstone@8cbfc122e806638bff5dbd7ff9d562b42b36ef03 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/krimvp
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8cbfc122e806638bff5dbd7ff9d562b42b36ef03 -
Trigger Event:
push
-
Statement type: