HarnessBench: compare agentic harnesses on everyday online tasks (sister project to ClawBench).
Project description
The Benchmark for Comparing Agent Harnesses on Everyday Online Tasks
Read the Docs
·
Harness Comparison
·
Cloud Setup
If you want to compare base models on a fixed harness, check out our sister project ClawBench — same pipeline, orthogonal axis.
uv tool install harness-bench && harness-bench
Install → List → Run. Reuses ClawBench's pipeline. Cloud harnesses opt-in via env vars.
Which Harness Wins on the Same Task?
Given one task (order food, book travel, apply for a job) and one fixed base model --
which agentic harness actually gets it done?
Six named harnesses, four runtimes, one pipeline, one leaderboard.
6 harnesses · 4 runtimes (Python / Node / Rust / Web) · 153 shared tasks · 15 categories
Plugin Entry-Points
One Container per Harness
Cloud Opt-in
Same Pipeline as ClawBench
How It Works
You pick a task HarnessBench spins up Each harness drives Same 5-layer recording
from ClawBench's one container per the browser its own + DOM-match + LLM judge
shared 153-task pool harness (Python / Node way on the same task partitioned by harness
/ Rust / Web)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ "Book a │ ──► │ 6 containers│ ──► │ 6 different │ ──► │ Per-harness │
│ flight on │ │ (one per │ │ agent loops │ │ leaderboard │
│ Expedia" │ │ harness) │ │ same task │ │ by category │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
LLM Quick Start
Point your coding agent (Claude Code, Cursor, Copilot, etc.) at AGENTS.md and prompt away. HarnessBench shares ClawBench's test-cases, container base image, and 5-layer recording stack -- if your agent already knows ClawBench, there is nothing new to learn about the pipeline, only a new harness axis.
Human Quick Start
# Option A -- PyPI install (recommended)
uv tool install harness-bench && harness-bench
# Option B -- Clone the repo (for contributors / adding a harness)
git clone https://github.com/reacher-z/HarnessBench.git && cd HarnessBench && uv run harness-bench
Prerequisites: Python 3.10+, uv, and a container engine -- Docker or Podman. Same engine detection as ClawBench; force one with export CONTAINER_ENGINE=docker.
1. List registered harnesses:
harness-bench harnesses
# openclaw ready
# hermes ready
# claw-code ready
# browser-use ready
# stagehand skipped: set BROWSERBASE_API_KEY
# coze-studio skipped: set COZE_INSTANCE_URL, COZE_API_TOKEN
2. Preview a matrix (no side effects):
harness-bench matrix \
--harness openclaw --harness hermes --harness browser-use \
--model claude-sonnet-4-6 \
--case 001-daily-life-food-uber-eats \
--case 007-daily-life-travel-expedia
3. Run one triple end-to-end:
harness-bench run \
--harness openclaw \
--model claude-sonnet-4-6 \
--case 001-daily-life-food-uber-eats
Results land in ./harness-output/<harness>/<model>/<case>-<timestamp>/ with the full five-layer recording -- identical layout to ClawBench so a single analysis script handles both.
4. Matrix batch (all eligible triples):
harness-bench batch \
--harness openclaw --harness hermes --harness browser-use --harness claw-code \
--model claude-sonnet-4-6 \
--case $(cat fixtures/lite.txt)
5. Render the leaderboard:
harness-bench leaderboard --results-dir ./harness-output/
HarnessBench-Lite
New here? Run this first. fixtures/lite.txt is a 20-task curated subset of ClawBench's 153, reused verbatim so HarnessBench-Lite and ClawBench-Lite are comparable row-for-row. It matches the 20-tasks-per-source convention of browser-use/benchmark and gives you a credible harness-vs-harness signal at a fraction of the full-matrix cost.
For six harnesses on Lite you're looking at roughly 120 triples (6 harnesses x 20 tasks); cloud-opt-in harnesses auto-skip if credentials are absent so the local-only cost is 80 triples.
harness-bench batch \
--harness openclaw -h hermes -h browser-use -h claw-code \
--model claude-sonnet-4-6 \
--case $(cat fixtures/lite.txt)
Tutorial
Demos
|
https://github.com/user-attachments/assets/placeholder-openclaw-ubereats |
https://github.com/user-attachments/assets/placeholder-browseruse-ubereats |
Each HarnessBench run produces the same MP4 session recording ClawBench does. Pair-watching the same task across two harnesses is the fastest way to see where their behavior diverges.
The Six Named Harnesses
| Harness | Upstream | Runtime | Cloud? | What it is |
|---|---|---|---|---|
openclaw |
reacher-z/ClawBench | Python | — | Reference harness, shared with ClawBench. The baseline everyone gets compared against. |
hermes |
nousresearch/hermes-agent | Python | — | Hermes-style tool-use loop with explicit plan/act steps. |
claw-code |
ultraworkers/claw-code | Rust | — | Rust-native agent loop, zero-GIL concurrency. |
browser-use |
browser-use/browser-use | Python | — | Community-favorite Playwright-based harness. |
stagehand |
browserbase/stagehand | Node/TS | Yes | BrowserBase's Stagehand -- requires BROWSERBASE_API_KEY. |
coze-studio |
coze-dev/coze-studio | Web | Yes | Coze Studio flow runner -- requires COZE_INSTANCE_URL + COZE_API_TOKEN. |
Cloud harnesses are opt-in: without credentials they appear in the matrix as skipped:missing_credential:<VAR> -- never silently zeroed. More harnesses land as follow-up PRs; see docs/scout-2026-04-16.md for the global framework sweep.
Preview Leaderboard
Work in progress. Initial runs on claude-sonnet-4-6 across the six harnesses are in flight -- numbers below are placeholders illustrating the leaderboard shape.
| Rank | Harness | Overall | Daily | Travel | Work | Dev | Notes |
|---|---|---|---|---|---|---|---|
| — | openclaw |
TBD | TBD | TBD | TBD | TBD | reference harness (ClawBench) |
| — | hermes |
TBD | TBD | TBD | TBD | TBD | Python tool-use loop |
| — | claw-code |
TBD | TBD | TBD | TBD | TBD | Rust agent loop |
| — | browser-use |
TBD | TBD | TBD | TBD | TBD | Playwright-based |
| — | stagehand |
TBD | TBD | TBD | TBD | TBD | cloud-opt-in |
| — | coze-studio |
TBD | TBD | TBD | TBD | TBD | cloud-opt-in |
Partitioning: (harness, model, category). Run harness-bench leaderboard locally to render your own.
Example Walkthrough
Curious what one triple actually looks like? Here's task 001 run through three different harnesses, same base model:
task = 001-daily-life-food-uber-eats
model = claude-sonnet-4-6
harness = openclaw ──► ./harness-output/openclaw/claude-sonnet-4-6/001-.../
(Python loop driving Chrome via the ClawBench extension)
harness = hermes ──► ./harness-output/hermes/claude-sonnet-4-6/001-.../
(Python loop, Hermes tool-use convention)
harness = browser-use ──► ./harness-output/browser-use/claude-sonnet-4-6/001-.../
(Playwright driver + atomic action primitives)
All three land the same five-layer bundle (recording.mp4, screenshots, actions.jsonl, requests.jsonl, agent-messages.jsonl) plus interception.json from ClawBench's CDP-level fetch interceptor. That uniformity is what makes cross-harness comparison meaningful: identical inputs, identical judge, identical rubric -- the only thing that moves between runs is the agent loop itself.
Architecture
How HarnessBench stacks on top of ClawBench
┌─────────────────────────────────────────────────────────┐
│ harness-bench CLI │
│ (matrix expansion, credential gating, leaderboard) │
└───────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ clawbench.harnesses (plugin entry-point group) │
│ discovered at runtime via importlib.metadata │
└───────────────────────┬─────────────────────────────────┘
│
┌──────────────┼──────────────┬──────────────┐
▼ ▼ ▼ ▼
openclaw hermes claw-code browser-use stagehand coze-studio
(Python) (Python) (Rust) (Python) (Node/TS) (Web)
dedicated dedicated dedicated dedicated dedicated dedicated
container container container container container container
│ │ │ │ │ │
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ clawbench/base:<version> │
│ (Chrome + Xvfb + FFmpeg + extension-server + CDP wire) │
│ Same image ClawBench uses -- zero drift. │
└─────────────────────────────────────────────────────────┘
Each harness ships its own Dockerfile (3-file adapter: Dockerfile + setup.sh + run.sh) that FROM clawbench/base:<version> so the shared stack is byte-for-byte identical across harnesses. See docs/adding-a-harness.md for the walkthrough.
CLI
# List and gate
harness-bench harnesses
# Matrix preview (no side effects)
harness-bench matrix --harness openclaw -h hermes -m claude-sonnet-4-6 -c 001-daily-life-food-uber-eats
# Single run
harness-bench run --harness openclaw --model claude-sonnet-4-6 --case 001-daily-life-food-uber-eats
# Batch (matrix-expand, skip ineligible, run the rest)
harness-bench batch -h openclaw -h hermes -h browser-use -m claude-sonnet-4-6 -c 001 -c 007
# Render leaderboard markdown
harness-bench leaderboard --results-dir ./harness-output/
Evaluation
Evaluation is inherited verbatim from ClawBench -- post-session judge comparing agent trajectories against human reference runs under eval/agentic_eval.md.
1. Run harnesses (batch) 2. Evaluate (clawbench eval)
───────────────────────── ────────────────────────────────
harness-bench batch ... ──► DOM-match + LLM judge re-used
produces harness-output/ exactly as ClawBench does it
with 5-layer recordings (same rubric, same prompt)
See ClawBench's eval guide -- since the recording format is identical, every tool in ClawBench's eval/ works unchanged on HarnessBench output.
FAQ
Why two repos instead of one tool with a --harness flag?
Runtime incompatibility. ClawBench's shared openclaw-bench container runs three Python harnesses side-by-side because they share a virtualenv. HarnessBench's six harnesses live in Python, Node/TS, Rust, and Web -- not co-installable in one image. Each gets its own container built on the shared clawbench/base:<version> image.
Orthogonal axis. ClawBench holds the harness fixed and sweeps models. HarnessBench holds the model fixed and sweeps harnesses. Same pipeline, different axis of interest -- keeping them as separate repos avoids overloading either CLI's flag surface.
Do I have to run cloud harnesses?
No. stagehand and coze-studio auto-skip without credentials and appear in the matrix as skipped:missing_credential:<VAR>. The four local-first harnesses (openclaw, hermes, claw-code, browser-use) are enough to produce a meaningful leaderboard on any workstation with Docker.
Can I add my own harness?
Yes -- three files (Dockerfile + setup.sh + run.sh) plus one pyproject.toml stanza. See docs/adding-a-harness.md. The plugin loads via the clawbench.harnesses entry-point group, so external packages can register without forking either repo.
How is this different from ClawBench?
- Axis. ClawBench: one harness, many models. HarnessBench: many harnesses, one (or many) models.
- Runtime. ClawBench bundles three Python harnesses in one container. HarnessBench gives each harness its own container (Python / Node / Rust / Web are not co-installable).
- Cloud. ClawBench is fully local-first. HarnessBench supports local-first and cloud-opt-in harnesses in the same matrix.
- Code reuse. 100% -- HarnessBench imports
clawbench-evalrather than forking it.
Which base model should I start with?
Whatever you already trust. The point of HarnessBench is that you pick one model and observe how different harnesses wrap it. For the published numbers we use claude-sonnet-4-6 (ClawBench's top scorer at 33.3% overall), which gives every harness a known-competitive model to wrap. Your own runs can use anything in your models.yaml.
Contributing
We welcome adapters for new harnesses, especially ones that survive the 30-agent global sweep. Most harness adapters are a single directory under src/harnessbench/harnesses/ with three files; see docs/adding-a-harness.md for the walkthrough.
Quick wins:
- Add a new harness adapter (~1-2 hours if upstream ships a CLI, ~1 day if you're writing one from scratch)
- Submit a leaderboard entry for a harness + model pair we haven't scored
- File a good first issue
Community
|
English community Shared with ClawBench |
中文社区 研究者、开发者、贡献者交流 |
Async Q&A Searchable, long-form, permanent |
License
Apache-2.0 for the repository. Each bundled harness adapter links to upstream code under the upstream's own license; nothing from an incompatible license is vendored.
Citation
If you use HarnessBench in your research, please cite:
@misc{zhang2026harnessbench,
title = {HarnessBench: Comparing Agentic Harnesses on Everyday Online Tasks},
author = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
year = {2026},
note = {Preprint in preparation},
howpublished = {\url{https://github.com/reacher-z/HarnessBench}}
}
Core Contributors
|
Yuxuan Zhang |
Yubo Wang |
Perry Zhu |
Penghui Du |
Junwen Miao |
Advisors
|
Kelsey R. Allen |
Wenhu Chen |
Dongfu Jiang |
Liang Chen |
Support HarnessBench
If HarnessBench is useful for your research or tool selection, the single most helpful thing you can do is star the repo -- it surfaces the harness-comparison axis to other agent researchers and helps us justify continued adapter work.
Open to contributions -- new harness adapters, leaderboard submissions, or evaluation bug fixes. See CONTRIBUTING.md.
Star History
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file harness_bench-0.1.6.tar.gz.
File metadata
- Download URL: harness_bench-0.1.6.tar.gz
- Upload date:
- Size: 46.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf9f53089c5d363ac24932530d1abf86f77916c7d5d10a3014feb831bb47619e
|
|
| MD5 |
597ddb78d381c36626c62129c872296f
|
|
| BLAKE2b-256 |
8fdc543f8666d60bd3fdb08185fd8dfdabd8bc74c2f25cd3ba67fab940583e6f
|
File details
Details for the file harness_bench-0.1.6-py3-none-any.whl.
File metadata
- Download URL: harness_bench-0.1.6-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b39e1b366d4da2c195787728e7af4fc84fdcadc14f86411d5c8ae03bb787ef3
|
|
| MD5 |
9a43bc25ba9395ca28b3255013b96342
|
|
| BLAKE2b-256 |
d4efebb255ebee004f7647a498faad3258cb8c8906d6394818318954a8c4e434
|