Skip to main content

HarnessBench: compare agentic harnesses on everyday online tasks (sister project to ClawBench).

Project description

HarnessBench

The Benchmark for Comparing Agent Harnesses on Everyday Online Tasks
Read the Docs  ·  Harness Comparison  ·  Cloud Setup

Star this repo Project Page GitHub stars Discord Codespaces

Sister project of ClawBench

Ask DeepWiki

If you want to compare base models on a fixed harness, check out our sister project ClawBench  —  same pipeline, orthogonal axis.

Run in one line of code

uv tool install harness-bench && harness-bench

Install → List → Run.   Reuses ClawBench's pipeline.   Cloud harnesses opt-in via env vars.

Which Harness Wins on the Same Task?

Given one task (order food, book travel, apply for a job) and one fixed base model --
which agentic harness actually gets it done?
Six named harnesses, four runtimes, one pipeline, one leaderboard.


6 harnesses  ·  4 runtimes (Python / Node / Rust / Web)  ·  153 shared tasks  ·  15 categories

中文


 Plugin Entry-Points         One Container per Harness         Cloud Opt-in         Same Pipeline as ClawBench


How It Works

   You pick a task            HarnessBench spins up        Each harness drives       Same 5-layer recording
   from ClawBench's           one container per            the browser its own       + DOM-match + LLM judge
   shared 153-task pool       harness (Python / Node       way on the same task      partitioned by harness
                              / Rust / Web)

   ┌──────────────┐           ┌──────────────┐           ┌──────────────┐           ┌──────────────┐
   │  "Book a     │    ──►    │  6 containers│    ──►    │  6 different │    ──►    │  Per-harness │
   │   flight on  │           │  (one per    │           │  agent loops │           │  leaderboard │
   │   Expedia"   │           │   harness)   │           │  same task   │           │  by category │
   └──────────────┘           └──────────────┘           └──────────────┘           └──────────────┘

LLM Quick Start

Point your coding agent (Claude Code, Cursor, Copilot, etc.) at AGENTS.md and prompt away. HarnessBench shares ClawBench's test-cases, container base image, and 5-layer recording stack -- if your agent already knows ClawBench, there is nothing new to learn about the pipeline, only a new harness axis.


Human Quick Start

# Option A -- PyPI install (recommended)
uv tool install harness-bench && harness-bench
# Option B -- Clone the repo (for contributors / adding a harness)
git clone https://github.com/reacher-z/HarnessBench.git && cd HarnessBench && uv run harness-bench

Prerequisites: Python 3.10+, uv, and a container engine -- Docker or Podman. Same engine detection as ClawBench; force one with export CONTAINER_ENGINE=docker.

1. List registered harnesses:

harness-bench harnesses
# openclaw       ready
# hermes         ready
# claw-code      ready
# browser-use    ready
# stagehand      skipped: set BROWSERBASE_API_KEY
# coze-studio    skipped: set COZE_INSTANCE_URL, COZE_API_TOKEN

2. Preview a matrix (no side effects):

harness-bench matrix \
    --harness openclaw --harness hermes --harness browser-use \
    --model   claude-sonnet-4-6 \
    --case    001-daily-life-food-uber-eats \
    --case    007-daily-life-travel-expedia

3. Run one triple end-to-end:

harness-bench run \
    --harness openclaw \
    --model   claude-sonnet-4-6 \
    --case    001-daily-life-food-uber-eats

Results land in ./harness-output/<harness>/<model>/<case>-<timestamp>/ with the full five-layer recording -- identical layout to ClawBench so a single analysis script handles both.

4. Matrix batch (all eligible triples):

harness-bench batch \
    --harness openclaw --harness hermes --harness browser-use --harness claw-code \
    --model   claude-sonnet-4-6 \
    --case    $(cat fixtures/lite.txt)

5. Render the leaderboard:

harness-bench leaderboard --results-dir ./harness-output/

HarnessBench-Lite

New here? Run this first. fixtures/lite.txt is a 20-task curated subset of ClawBench's 153, reused verbatim so HarnessBench-Lite and ClawBench-Lite are comparable row-for-row. It matches the 20-tasks-per-source convention of browser-use/benchmark and gives you a credible harness-vs-harness signal at a fraction of the full-matrix cost.

For six harnesses on Lite you're looking at roughly 120 triples (6 harnesses x 20 tasks); cloud-opt-in harnesses auto-skip if credentials are absent so the local-only cost is 80 triples.

harness-bench batch \
    --harness openclaw -h hermes -h browser-use -h claw-code \
    --model   claude-sonnet-4-6 \
    --case    $(cat fixtures/lite.txt)

Tutorial

Watch on YouTube    Watch on Bilibili


Demos

openclaw on Uber Eats

https://github.com/user-attachments/assets/placeholder-openclaw-ubereats

browser-use on the same Uber Eats task

https://github.com/user-attachments/assets/placeholder-browseruse-ubereats

Each HarnessBench run produces the same MP4 session recording ClawBench does. Pair-watching the same task across two harnesses is the fastest way to see where their behavior diverges.


The Six Named Harnesses

Harness Upstream Runtime Cloud? What it is
openclaw reacher-z/ClawBench Python Reference harness, shared with ClawBench. The baseline everyone gets compared against.
hermes nousresearch/hermes-agent Python Hermes-style tool-use loop with explicit plan/act steps.
claw-code ultraworkers/claw-code Rust Rust-native agent loop, zero-GIL concurrency.
browser-use browser-use/browser-use Python Community-favorite Playwright-based harness.
stagehand browserbase/stagehand Node/TS Yes BrowserBase's Stagehand -- requires BROWSERBASE_API_KEY.
coze-studio coze-dev/coze-studio Web Yes Coze Studio flow runner -- requires COZE_INSTANCE_URL + COZE_API_TOKEN.

Cloud harnesses are opt-in: without credentials they appear in the matrix as skipped:missing_credential:<VAR> -- never silently zeroed. More harnesses land as follow-up PRs; see docs/scout-2026-04-16.md for the global framework sweep.


Preview Leaderboard

Work in progress. Initial runs on claude-sonnet-4-6 across the six harnesses are in flight -- numbers below are placeholders illustrating the leaderboard shape.

Rank Harness Overall Daily Travel Work Dev Notes
openclaw TBD TBD TBD TBD TBD reference harness (ClawBench)
hermes TBD TBD TBD TBD TBD Python tool-use loop
claw-code TBD TBD TBD TBD TBD Rust agent loop
browser-use TBD TBD TBD TBD TBD Playwright-based
stagehand TBD TBD TBD TBD TBD cloud-opt-in
coze-studio TBD TBD TBD TBD TBD cloud-opt-in

Partitioning: (harness, model, category). Run harness-bench leaderboard locally to render your own.


Example Walkthrough

Curious what one triple actually looks like? Here's task 001 run through three different harnesses, same base model:

task    = 001-daily-life-food-uber-eats
model   = claude-sonnet-4-6

harness = openclaw      ──►  ./harness-output/openclaw/claude-sonnet-4-6/001-.../
                             (Python loop driving Chrome via the ClawBench extension)

harness = hermes        ──►  ./harness-output/hermes/claude-sonnet-4-6/001-.../
                             (Python loop, Hermes tool-use convention)

harness = browser-use   ──►  ./harness-output/browser-use/claude-sonnet-4-6/001-.../
                             (Playwright driver + atomic action primitives)

All three land the same five-layer bundle (recording.mp4, screenshots, actions.jsonl, requests.jsonl, agent-messages.jsonl) plus interception.json from ClawBench's CDP-level fetch interceptor. That uniformity is what makes cross-harness comparison meaningful: identical inputs, identical judge, identical rubric -- the only thing that moves between runs is the agent loop itself.


Architecture

How HarnessBench stacks on top of ClawBench
 ┌─────────────────────────────────────────────────────────┐
 │  harness-bench CLI                                      │
 │  (matrix expansion, credential gating, leaderboard)     │
 └───────────────────────┬─────────────────────────────────┘
                         │
                         ▼
 ┌─────────────────────────────────────────────────────────┐
 │  clawbench.harnesses  (plugin entry-point group)        │
 │  discovered at runtime via importlib.metadata           │
 └───────────────────────┬─────────────────────────────────┘
                         │
          ┌──────────────┼──────────────┬──────────────┐
          ▼              ▼              ▼              ▼
      openclaw         hermes        claw-code      browser-use      stagehand      coze-studio
      (Python)         (Python)      (Rust)         (Python)         (Node/TS)      (Web)
      dedicated        dedicated     dedicated      dedicated        dedicated      dedicated
      container        container     container      container        container      container
          │              │              │              │              │              │
          └──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
                                                 │
                                                 ▼
 ┌─────────────────────────────────────────────────────────┐
 │  clawbench/base:<version>                               │
 │  (Chrome + Xvfb + FFmpeg + extension-server + CDP wire) │
 │  Same image ClawBench uses -- zero drift.               │
 └─────────────────────────────────────────────────────────┘

Each harness ships its own Dockerfile (3-file adapter: Dockerfile + setup.sh + run.sh) that FROM clawbench/base:<version> so the shared stack is byte-for-byte identical across harnesses. See docs/adding-a-harness.md for the walkthrough.


CLI

# List and gate
harness-bench harnesses

# Matrix preview (no side effects)
harness-bench matrix --harness openclaw -h hermes -m claude-sonnet-4-6 -c 001-daily-life-food-uber-eats

# Single run
harness-bench run --harness openclaw --model claude-sonnet-4-6 --case 001-daily-life-food-uber-eats

# Batch (matrix-expand, skip ineligible, run the rest)
harness-bench batch -h openclaw -h hermes -h browser-use -m claude-sonnet-4-6 -c 001 -c 007

# Render leaderboard markdown
harness-bench leaderboard --results-dir ./harness-output/

Evaluation

Evaluation is inherited verbatim from ClawBench -- post-session judge comparing agent trajectories against human reference runs under eval/agentic_eval.md.

 1. Run harnesses (batch)          2. Evaluate (clawbench eval)
 ─────────────────────────         ────────────────────────────────
 harness-bench batch ...    ──►    DOM-match + LLM judge re-used
 produces harness-output/          exactly as ClawBench does it
   with 5-layer recordings         (same rubric, same prompt)

See ClawBench's eval guide -- since the recording format is identical, every tool in ClawBench's eval/ works unchanged on HarnessBench output.


FAQ

Why two repos instead of one tool with a --harness flag?

Runtime incompatibility. ClawBench's shared openclaw-bench container runs three Python harnesses side-by-side because they share a virtualenv. HarnessBench's six harnesses live in Python, Node/TS, Rust, and Web -- not co-installable in one image. Each gets its own container built on the shared clawbench/base:<version> image.

Orthogonal axis. ClawBench holds the harness fixed and sweeps models. HarnessBench holds the model fixed and sweeps harnesses. Same pipeline, different axis of interest -- keeping them as separate repos avoids overloading either CLI's flag surface.

Do I have to run cloud harnesses?

No. stagehand and coze-studio auto-skip without credentials and appear in the matrix as skipped:missing_credential:<VAR>. The four local-first harnesses (openclaw, hermes, claw-code, browser-use) are enough to produce a meaningful leaderboard on any workstation with Docker.

Can I add my own harness?

Yes -- three files (Dockerfile + setup.sh + run.sh) plus one pyproject.toml stanza. See docs/adding-a-harness.md. The plugin loads via the clawbench.harnesses entry-point group, so external packages can register without forking either repo.

How is this different from ClawBench?
  • Axis. ClawBench: one harness, many models. HarnessBench: many harnesses, one (or many) models.
  • Runtime. ClawBench bundles three Python harnesses in one container. HarnessBench gives each harness its own container (Python / Node / Rust / Web are not co-installable).
  • Cloud. ClawBench is fully local-first. HarnessBench supports local-first and cloud-opt-in harnesses in the same matrix.
  • Code reuse. 100% -- HarnessBench imports clawbench-eval rather than forking it.
Which base model should I start with?

Whatever you already trust. The point of HarnessBench is that you pick one model and observe how different harnesses wrap it. For the published numbers we use claude-sonnet-4-6 (ClawBench's top scorer at 33.3% overall), which gives every harness a known-competitive model to wrap. Your own runs can use anything in your models.yaml.


Contributing

We welcome adapters for new harnesses, especially ones that survive the 30-agent global sweep. Most harness adapters are a single directory under src/harnessbench/harnesses/ with three files; see docs/adding-a-harness.md for the walkthrough.

Quick wins:

Community

Discord
English community
Shared with ClawBench
微信群
中文社区
研究者、开发者、贡献者交流
GitHub Discussions
Async Q&A
Searchable, long-form, permanent

License

Apache-2.0 for the repository. Each bundled harness adapter links to upstream code under the upstream's own license; nothing from an incompatible license is vendored.

Citation

If you use HarnessBench in your research, please cite:

@misc{zhang2026harnessbench,
  title        = {HarnessBench: Comparing Agentic Harnesses on Everyday Online Tasks},
  author       = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  year         = {2026},
  note         = {Preprint in preparation},
  howpublished = {\url{https://github.com/reacher-z/HarnessBench}}
}

Core Contributors


Yuxuan Zhang

Yubo Wang

Perry Zhu

Penghui Du

Junwen Miao

Advisors


Kelsey R. Allen

Wenhu Chen

Dongfu Jiang

Liang Chen

Support HarnessBench

If HarnessBench is useful for your research or tool selection, the single most helpful thing you can do is star the repo -- it surfaces the harness-comparison axis to other agent researchers and helps us justify continued adapter work.

Star this repo

Open to contributions -- new harness adapters, leaderboard submissions, or evaluation bug fixes. See CONTRIBUTING.md.

Contributors

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harness_bench-0.1.6.tar.gz (46.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harness_bench-0.1.6-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file harness_bench-0.1.6.tar.gz.

File metadata

  • Download URL: harness_bench-0.1.6.tar.gz
  • Upload date:
  • Size: 46.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for harness_bench-0.1.6.tar.gz
Algorithm Hash digest
SHA256 bf9f53089c5d363ac24932530d1abf86f77916c7d5d10a3014feb831bb47619e
MD5 597ddb78d381c36626c62129c872296f
BLAKE2b-256 8fdc543f8666d60bd3fdb08185fd8dfdabd8bc74c2f25cd3ba67fab940583e6f

See more details on using hashes here.

File details

Details for the file harness_bench-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: harness_bench-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for harness_bench-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 4b39e1b366d4da2c195787728e7af4fc84fdcadc14f86411d5c8ae03bb787ef3
MD5 9a43bc25ba9395ca28b3255013b96342
BLAKE2b-256 d4efebb255ebee004f7647a498faad3258cb8c8906d6394818318954a8c4e434

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page