HarnessBench: compare agentic harnesses on everyday online tasks (sister project to ClawBench).

These details have not been verified by PyPI

Project links

Project description

The Benchmark for Comparing Agent Harnesses on Everyday Online Tasks
Read the Docs · Harness Comparison · Cloud Setup

If you want to compare base models on a fixed harness, check out our sister project ClawBench — same pipeline, orthogonal axis.

uv tool install harness-bench && harness-bench

_{Install → List → Run. Reuses ClawBench's pipeline. Cloud harnesses opt-in via env vars.}

Which Harness Wins on the Same Task?

Given one task (order food, book travel, apply for a job) and one fixed base model --
which agentic harness actually gets it done?
Six named harnesses, four runtimes, one pipeline, one leaderboard.

6 harnesses · 4 runtimes (Python / Node / Rust / Web) · 153 shared tasks · 15 categories

中文

Plugin Entry-Points One Container per Harness Cloud Opt-in Same Pipeline as ClawBench

How It Works

   You pick a task            HarnessBench spins up        Each harness drives       Same 5-layer recording
   from ClawBench's           one container per            the browser its own       + DOM-match + LLM judge
   shared 153-task pool       harness (Python / Node       way on the same task      partitioned by harness
                              / Rust / Web)

   ┌──────────────┐           ┌──────────────┐           ┌──────────────┐           ┌──────────────┐
   │  "Book a     │    ──►    │  6 containers│    ──►    │  6 different │    ──►    │  Per-harness │
   │   flight on  │           │  (one per    │           │  agent loops │           │  leaderboard │
   │   Expedia"   │           │   harness)   │           │  same task   │           │  by category │
   └──────────────┘           └──────────────┘           └──────────────┘           └──────────────┘

LLM Quick Start

Point your coding agent (Claude Code, Cursor, Copilot, etc.) at AGENTS.md and prompt away. HarnessBench shares ClawBench's test-cases, container base image, and 5-layer recording stack -- if your agent already knows ClawBench, there is nothing new to learn about the pipeline, only a new harness axis.

Human Quick Start

# Option A -- PyPI install (recommended)
uv tool install harness-bench && harness-bench

# Option B -- Clone the repo (for contributors / adding a harness)
git clone https://github.com/reacher-z/HarnessBench.git && cd HarnessBench && uv run harness-bench

Prerequisites: Python 3.10+, uv, and a container engine -- Docker or Podman. Same engine detection as ClawBench; force one with export CONTAINER_ENGINE=docker.

1. List registered harnesses:

harness-bench harnesses
# openclaw       ready
# hermes         ready
# claw-code      ready
# browser-use    ready
# stagehand      skipped: set BROWSERBASE_API_KEY
# coze-studio    skipped: set COZE_INSTANCE_URL, COZE_API_TOKEN

2. Preview a matrix (no side effects):

harness-bench matrix \
    --harness openclaw --harness hermes --harness browser-use \
    --model   claude-sonnet-4-6 \
    --case    001-daily-life-food-uber-eats \
    --case    007-daily-life-travel-expedia

3. Run one triple end-to-end:

harness-bench run \
    --harness openclaw \
    --model   claude-sonnet-4-6 \
    --case    001-daily-life-food-uber-eats

Results land in ./harness-output/<harness>/<model>/<case>-<timestamp>/ with the full five-layer recording -- identical layout to ClawBench so a single analysis script handles both.

4. Matrix batch (all eligible triples):

harness-bench batch \
    --harness openclaw --harness hermes --harness browser-use --harness claw-code \
    --model   claude-sonnet-4-6 \
    --case    $(cat fixtures/lite.txt)

5. Render the leaderboard:

harness-bench leaderboard --results-dir ./harness-output/

HarnessBench-Lite

New here? Run this first. fixtures/lite.txt is a 20-task curated subset of ClawBench's 153, reused verbatim so HarnessBench-Lite and ClawBench-Lite are comparable row-for-row. It matches the 20-tasks-per-source convention of browser-use/benchmark and gives you a credible harness-vs-harness signal at a fraction of the full-matrix cost.

For six harnesses on Lite you're looking at roughly 120 triples (6 harnesses x 20 tasks); cloud-opt-in harnesses auto-skip if credentials are absent so the local-only cost is 80 triples.

harness-bench batch \
    --harness openclaw -h hermes -h browser-use -h claw-code \
    --model   claude-sonnet-4-6 \
    --case    $(cat fixtures/lite.txt)

Tutorial

Demos

openclaw on Uber Eats

https://github.com/user-attachments/assets/placeholder-openclaw-ubereats

browser-use on the same Uber Eats task

https://github.com/user-attachments/assets/placeholder-browseruse-ubereats

Each HarnessBench run produces the same MP4 session recording ClawBench does. Pair-watching the same task across two harnesses is the fastest way to see where their behavior diverges.

The Six Named Harnesses

Harness	Upstream	Runtime	Cloud?	What it is
`openclaw`	reacher-z/ClawBench	Python	—	Reference harness, shared with ClawBench. The baseline everyone gets compared against.
`hermes`	nousresearch/hermes-agent	Python	—	Hermes-style tool-use loop with explicit plan/act steps.
`claw-code`	ultraworkers/claw-code	Rust	—	Rust-native agent loop, zero-GIL concurrency.
`browser-use`	browser-use/browser-use	Python	—	Community-favorite Playwright-based harness.
`stagehand`	browserbase/stagehand	Node/TS	Yes	BrowserBase's Stagehand -- requires `BROWSERBASE_API_KEY`.
`coze-studio`	coze-dev/coze-studio	Web	Yes	Coze Studio flow runner -- requires `COZE_INSTANCE_URL` + `COZE_API_TOKEN`.

Cloud harnesses are opt-in: without credentials they appear in the matrix as skipped:missing_credential:<VAR> -- never silently zeroed. More harnesses land as follow-up PRs; see docs/scout-2026-04-16.md for the global framework sweep.

Preview Leaderboard

Work in progress. Initial runs on claude-sonnet-4-6 across the six harnesses are in flight -- numbers below are placeholders illustrating the leaderboard shape.

Rank	Harness	Overall	Daily	Travel	Work	Dev	Notes
—	`openclaw`	TBD	TBD	TBD	TBD	TBD	reference harness (ClawBench)
—	`hermes`	TBD	TBD	TBD	TBD	TBD	Python tool-use loop
—	`claw-code`	TBD	TBD	TBD	TBD	TBD	Rust agent loop
—	`browser-use`	TBD	TBD	TBD	TBD	TBD	Playwright-based
—	`stagehand`	TBD	TBD	TBD	TBD	TBD	cloud-opt-in
—	`coze-studio`	TBD	TBD	TBD	TBD	TBD	cloud-opt-in

_{Partitioning: (harness, model, category). Run harness-bench leaderboard locally to render your own.}

Example Walkthrough

Curious what one triple actually looks like? Here's task 001 run through three different harnesses, same base model:

task    = 001-daily-life-food-uber-eats
model   = claude-sonnet-4-6

harness = openclaw      ──►  ./harness-output/openclaw/claude-sonnet-4-6/001-.../
                             (Python loop driving Chrome via the ClawBench extension)

harness = hermes        ──►  ./harness-output/hermes/claude-sonnet-4-6/001-.../
                             (Python loop, Hermes tool-use convention)

harness = browser-use   ──►  ./harness-output/browser-use/claude-sonnet-4-6/001-.../
                             (Playwright driver + atomic action primitives)

All three land the same five-layer bundle (recording.mp4, screenshots, actions.jsonl, requests.jsonl, agent-messages.jsonl) plus interception.json from ClawBench's CDP-level fetch interceptor. That uniformity is what makes cross-harness comparison meaningful: identical inputs, identical judge, identical rubric -- the only thing that moves between runs is the agent loop itself.

Architecture

How HarnessBench stacks on top of ClawBench

 ┌─────────────────────────────────────────────────────────┐
 │  harness-bench CLI                                      │
 │  (matrix expansion, credential gating, leaderboard)     │
 └───────────────────────┬─────────────────────────────────┘
                         │
                         ▼
 ┌─────────────────────────────────────────────────────────┐
 │  clawbench.harnesses  (plugin entry-point group)        │
 │  discovered at runtime via importlib.metadata           │
 └───────────────────────┬─────────────────────────────────┘
                         │
          ┌──────────────┼──────────────┬──────────────┐
          ▼              ▼              ▼              ▼
      openclaw         hermes        claw-code      browser-use      stagehand      coze-studio
      (Python)         (Python)      (Rust)         (Python)         (Node/TS)      (Web)
      dedicated        dedicated     dedicated      dedicated        dedicated      dedicated
      container        container     container      container        container      container
          │              │              │              │              │              │
          └──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
                                                 │
                                                 ▼
 ┌─────────────────────────────────────────────────────────┐
 │  clawbench/base:<version>                               │
 │  (Chrome + Xvfb + FFmpeg + extension-server + CDP wire) │
 │  Same image ClawBench uses -- zero drift.               │
 └─────────────────────────────────────────────────────────┘

Each harness ships its own Dockerfile (3-file adapter: Dockerfile + setup.sh + run.sh) that FROM clawbench/base:<version> so the shared stack is byte-for-byte identical across harnesses. See docs/adding-a-harness.md for the walkthrough.

CLI

# List and gate
harness-bench harnesses

# Matrix preview (no side effects)
harness-bench matrix --harness openclaw -h hermes -m claude-sonnet-4-6 -c 001-daily-life-food-uber-eats

# Single run
harness-bench run --harness openclaw --model claude-sonnet-4-6 --case 001-daily-life-food-uber-eats

# Batch (matrix-expand, skip ineligible, run the rest)
harness-bench batch -h openclaw -h hermes -h browser-use -m claude-sonnet-4-6 -c 001 -c 007

# Render leaderboard markdown
harness-bench leaderboard --results-dir ./harness-output/

Evaluation

Evaluation is inherited verbatim from ClawBench -- post-session judge comparing agent trajectories against human reference runs under eval/agentic_eval.md.

 1. Run harnesses (batch)          2. Evaluate (clawbench eval)
 ─────────────────────────         ────────────────────────────────
 harness-bench batch ...    ──►    DOM-match + LLM judge re-used
 produces harness-output/          exactly as ClawBench does it
   with 5-layer recordings         (same rubric, same prompt)

See ClawBench's eval guide -- since the recording format is identical, every tool in ClawBench's eval/ works unchanged on HarnessBench output.

FAQ

Why two repos instead of one tool with a --harness flag?

Runtime incompatibility. ClawBench's shared openclaw-bench container runs three Python harnesses side-by-side because they share a virtualenv. HarnessBench's six harnesses live in Python, Node/TS, Rust, and Web -- not co-installable in one image. Each gets its own container built on the shared clawbench/base:<version> image.

Orthogonal axis. ClawBench holds the harness fixed and sweeps models. HarnessBench holds the model fixed and sweeps harnesses. Same pipeline, different axis of interest -- keeping them as separate repos avoids overloading either CLI's flag surface.

Do I have to run cloud harnesses?

No. stagehand and coze-studio auto-skip without credentials and appear in the matrix as skipped:missing_credential:<VAR>. The four local-first harnesses (openclaw, hermes, claw-code, browser-use) are enough to produce a meaningful leaderboard on any workstation with Docker.

Can I add my own harness?

Yes -- three files (Dockerfile + setup.sh + run.sh) plus one pyproject.toml stanza. See docs/adding-a-harness.md. The plugin loads via the clawbench.harnesses entry-point group, so external packages can register without forking either repo.

How is this different from ClawBench?

Axis. ClawBench: one harness, many models. HarnessBench: many harnesses, one (or many) models.
Runtime. ClawBench bundles three Python harnesses in one container. HarnessBench gives each harness its own container (Python / Node / Rust / Web are not co-installable).
Cloud. ClawBench is fully local-first. HarnessBench supports local-first and cloud-opt-in harnesses in the same matrix.
Code reuse. 100% -- HarnessBench imports clawbench-eval rather than forking it.

Which base model should I start with?

Whatever you already trust. The point of HarnessBench is that you pick one model and observe how different harnesses wrap it. For the published numbers we use claude-sonnet-4-6 (ClawBench's top scorer at 33.3% overall), which gives every harness a known-competitive model to wrap. Your own runs can use anything in your models.yaml.

Contributing

We welcome adapters for new harnesses, especially ones that survive the 30-agent global sweep. Most harness adapters are a single directory under src/harnessbench/harnesses/ with three files; see docs/adding-a-harness.md for the walkthrough.

Quick wins:

Add a new harness adapter (~1-2 hours if upstream ships a CLI, ~1 day if you're writing one from scratch)
Submit a leaderboard entry for a harness + model pair we haven't scored
File a good first issue

Community

_{English community
Shared with ClawBench}

_{中文社区
研究者、开发者、贡献者交流}

_{Async Q&A
Searchable, long-form, permanent}

License

Apache-2.0 for the repository. Each bundled harness adapter links to upstream code under the upstream's own license; nothing from an incompatible license is vendored.

Citation

If you use HarnessBench in your research, please cite:

@misc{zhang2026harnessbench,
  title        = {HarnessBench: Comparing Agentic Harnesses on Everyday Online Tasks},
  author       = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  year         = {2026},
  note         = {Preprint in preparation},
  howpublished = {\url{https://github.com/reacher-z/HarnessBench}}
}

Core Contributors

_{Yuxuan Zhang}

_{Yubo Wang}

_{Perry Zhu}

_{Penghui Du}

_{Junwen Miao}

Advisors

_{Kelsey R. Allen}

_{Wenhu Chen}

_{Dongfu Jiang}

_{Liang Chen}

Support HarnessBench

If HarnessBench is useful for your research or tool selection, the single most helpful thing you can do is star the repo -- it surfaces the harness-comparison axis to other agent researchers and helps us justify continued adapter work.

Open to contributions -- new harness adapters, leaderboard submissions, or evaluation bug fixes. See CONTRIBUTING.md.

Star History

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Apr 16, 2026

0.1.5

Apr 13, 2026

0.1.4

Apr 13, 2026

0.1.3

Apr 13, 2026

0.1.2

Apr 13, 2026

0.1.1

Apr 13, 2026

0.1.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harness_bench-0.1.6.tar.gz (46.5 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

harness_bench-0.1.6-py3-none-any.whl (35.1 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file harness_bench-0.1.6.tar.gz.

File metadata

Download URL: harness_bench-0.1.6.tar.gz
Upload date: Apr 16, 2026
Size: 46.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for harness_bench-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`bf9f53089c5d363ac24932530d1abf86f77916c7d5d10a3014feb831bb47619e`
MD5	`597ddb78d381c36626c62129c872296f`
BLAKE2b-256	`8fdc543f8666d60bd3fdb08185fd8dfdabd8bc74c2f25cd3ba67fab940583e6f`

See more details on using hashes here.

File details

Details for the file harness_bench-0.1.6-py3-none-any.whl.

File metadata

Download URL: harness_bench-0.1.6-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 35.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for harness_bench-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b39e1b366d4da2c195787728e7af4fc84fdcadc14f86411d5c8ae03bb787ef3`
MD5	`9a43bc25ba9395ca28b3255013b96342`
BLAKE2b-256	`d4efebb255ebee004f7647a498faad3258cb8c8906d6394818318954a8c4e434`

See more details on using hashes here.

harness-bench 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Which Harness Wins on the Same Task?

How It Works

LLM Quick Start

Human Quick Start

HarnessBench-Lite

Tutorial

Demos

The Six Named Harnesses

Preview Leaderboard

Example Walkthrough

Architecture

CLI

Evaluation

FAQ

Contributing

Community

License

Citation

Core Contributors

Advisors

Support HarnessBench

Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes