Skip to main content

ClawBench: Can AI Agents Complete Everyday Online Tasks? (alias of claw-bench)

Project description

ClawBench

arXiv HF Daily Paper HF Dataset Project Page GitHub stars

Run in one line of code

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh

Clone → Run → Done.   No API keys.   No dataset download.   No manual setup.

Can AI Agents Complete Everyday Online Tasks?

We asked 6 frontier AI agents to do what people do every day --
order food, book travel, apply for jobs, write reviews, manage projects.
The best model completed only 33.3% of tasks.


153 everyday tasks  ·  144 live websites  ·  15 life categories

中文


 Live Websites         Isolated Containers         Request Interceptor         Five-Layer Recording


How It Works

   You pick a task            ClawBench spins up           Agent drives the         Interceptor captures
   from 153 real-world        an isolated Docker           browser: navigates,      every action across
   everyday scenarios         container + Chromium         fills forms, clicks      all 5 layers of data
                                                                                    
   ┌──────────────┐           ┌──────────────┐           ┌──────────────┐           ┌──────────────┐
   │  "Book a pet │    ──►    │   Container  │    ──►    │   AI Agent   │    ──►    │   5 layers   │
   │   sitter on  │           │  + Chromium  │           │  browses the │           │  intercepted │
   │   Rover"     │           │  + Agent     │           │   live site  │           │  & recorded  │
   └──────────────┘           └──────────────┘           └──────────────┘           └──────────────┘

LLM Quick Start

Point your coding agent (Claude Code, Cursor, Copilot, etc.) at AGENTS.md and prompt away.


Human Quick Start

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh

Prerequisites: Python 3.11+, uv, and a container engine — Docker or Podman. ClawBench auto-detects whichever is installed; force one with export CONTAINER_ENGINE=docker or export CONTAINER_ENGINE=podman.

Install Docker or Podman (macOS / Linux / Windows)

macOS

# Option A — Docker Desktop (easiest, includes GUI)
brew install --cask docker
open -a Docker                 # launch and wait for the whale icon to settle

# Option B — Podman (rootless, no daemon, CLI only)
brew install podman
podman machine init            # one-time: downloads the Linux VM image
podman machine start           # must be running before any podman command

macOS Podman needs a VM. brew install podman alone is not enough — Podman on macOS runs containers inside a small Linux VM, so you must podman machine init && podman machine start once after install or podman info will fail with Cannot connect to Podman.

Linux (Ubuntu / Debian)

# Option A — Podman (rootless by default, recommended)
sudo apt update && sudo apt install -y podman

# Option B — Docker
sudo apt install -y docker.io
sudo usermod -aG docker $USER  # log out / back in so your shell picks up the group

Rootful Docker ownership note: with classic sudo-docker, files extracted from containers land owned by root on the host. ClawBench's driver detects this after each run and chowns test-output/ back to your user automatically — but if you run other container tooling alongside, rootless Podman (or rootless Docker) avoids the issue entirely.

Windows

# Option A — Docker Desktop (WSL2 backend)
winget install Docker.DockerDesktop
# then launch Docker Desktop from the Start menu and wait for it to be ready

# Option B — Podman
winget install RedHat.Podman
podman machine init
podman machine start

Run the uv run … commands below from PowerShell, WSL2, or Git Bash. Like macOS, Windows Podman requires podman machine init && podman machine start before its first use.

1. Clone and configure:

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench
cp models/models.example.yaml models/models.yaml   # edit: add your model API keys
# `.env` (PurelyMail creds for disposable-email signups) is already committed
# and works out of the box. Edit it only to override defaults or add HF_TOKEN.

[!NOTE] First run builds a container image (chromium + ffmpeg + noVNC + Node + openclaw, roughly 2 GB download, 5–10 min on a decent connection). You'll see a live progress spinner with the current build step. Subsequent runs reuse the cached layers and finish in seconds.

2. Run your first task (pick one):

[!TIP] Recommended → Interactive TUI   guided model + test case selection

./run.sh

Needs an interactive terminal. For pipes / CI / non-TTY, call test-driver/run.py or test-driver/batch.py directly.

(b) Run one specific task against a specific model:

uv run --project test-driver test-driver/run.py \
  test-cases/001-daily-life-food-uber-eats claude-sonnet-4-6

Once the container starts, the script prints a noVNC URL (e.g. http://localhost:6080/vnc.html) — open it in your browser to watch the agent operate in real-time. If port 6080 is already in use, an alternative port is chosen automatically.

Results land in test-output/<model>/<timestamp>-001-.../ with the full five-layer recording.

(c) Drive the browser yourself via noVNC — produces a human reference run:

uv run --project test-driver test-driver/run.py \
  test-cases/001-daily-life-food-uber-eats --human

Open the noVNC URL the script prints, complete the task by hand, then close the tab. Port is auto-assigned if 6080 is busy.


ClawBench-Lite

New here? Run this first. test-cases/lite.json is a 20-task curated subset of the full 153, selected for household-name sites, real-world relevance, difficulty, and category diversity. It matches the 20-tasks-per-source convention of browser-use/benchmark and gives you a credible signal at a fraction of the full-benchmark cost.

Tier distribution: flagship 9 / core 8 / wildcard 3 — spanning daily life (OpenTable, DoorDash, Instacart, TaskRabbit), entertainment (Eventbrite, Goodreads, Fandango), creation (Asana, Mailchimp, Squarespace), travel (Airbnb), education (LeetCode), dev-tech (GitHub), academia (Overleaf), personal management (1Password), and more. All Lite tasks are judged by eval/agentic_eval.md regardless of url_pattern shape.

See test-cases/lite.schema.json for the manifest shape and the notes field in lite.json for the 4-axis selection rubric + full swap history.


Tutorial

Watch on YouTube    Watch on Bilibili


Demos

Ordering food on Uber Eats

https://github.com/user-attachments/assets/placeholder-uber-eats

Submitting a job application

https://github.com/user-attachments/assets/placeholder-greenhouse

Each ClawBench run produces a full MP4 session recording. See the project page for all 153 task recordings.


Example Walkthrough

Curious what one task actually looks like, start to finish? Here's task 001 end to end.

The task — from test-cases/001-daily-life-food-uber-eats/task.json:

{
  "instruction": "On Uber Eats, order delivery: one Pad Thai, deliver to home address, note \"no peanuts\"",
  "time_limit": 30,
  "eval_schema": {
    "url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__",
    "method": "POST"
  }
}

The agent gets this instruction verbatim, plus read-only access to /my-info/alex_green_personal_info.json (the dummy user's name, home address, phone, date of birth) and a disposable email account for any sign-in prompt. It has 30 minutes to reach a POST request — any longer and the container is killed.

What the agent does (the happy path):

  1. Navigates to ubereats.com
  2. Reads the dummy user's home address from /my-info/alex_green_personal_info.json and enters it in the delivery-address box
  3. Searches for "Pad Thai" in the food search
  4. Picks a restaurant that has Pad Thai available for delivery to that address
  5. Opens the item detail page, finds the customization or special-instructions field, enters "no peanuts"
  6. Adds one to cart, opens the cart, and handles any sign-in prompt using the disposable email credentials
  7. Reaches checkout, taps Place Order

What the interceptor catches — that final Place Order tap fires a POST request. ClawBench's request interceptor sits in front of the browser and captures the outbound request before it reaches Uber Eats's servers, so the dummy user is never actually charged. At the exact moment of interception, all five recording layers (MP4 video, PNG screenshots, HTTP traffic, browser actions, agent messages) are frozen into /data/.

How the judge decides PASS / FAIL — task 001's url_pattern is the intentional sentinel __PLACEHOLDER_WILL_NOT_MATCH__, which means no request path can mechanically match. The verdict comes from the agentic judge in eval/agentic_eval.md, which replays the five-layer recording against a human reference run and checks four things:

  • Did the agent actually reach the final checkout step?
  • Is the cart exactly one Pad Thai (not two, not a combo)?
  • Is the delivery address the user's home address from alex_green_personal_info.json?
  • Does the order carry the "no peanuts" note in the instructions field?

All four must hold for a PASS. Miss any one and it's a FAIL with evidence from the recording pinned to the failing criterion. This per-task rubric is what makes ClawBench judge-sensitive rather than URL-regex-sensitive — see eval/README.md for the full rubric format and eval/agentic_eval.md for the judge prompt.


Results

Success rate (%) of 6 frontier AI agents on ClawBench

Rank Model Overall Daily Finance Work Dev Academic Travel Social Pets
1 Claude Sonnet 4.6 33.3 44.2 50.0 19.0 11.1 50.0 23.1 38.9 18.2
2 GLM-5 24.2 30.8 16.7 38.1 16.7 28.6 0.0 16.7 18.2
3 Gemini 3 Flash 19.0 15.4 33.3 23.8 22.2 28.6 30.8 11.1 0.0
4 Claude Haiku 4.5 18.3 15.4 22.2 19.0 27.8 21.4 7.7 16.7 18.2
5 GPT-5.4 6.5 9.6 0.0 0.0 11.1 7.1 7.7 0.0 9.1
6 Gemini 3.1 Flash Lite 3.3 1.9 0.0 0.0 5.6 14.3 0.0 0.0 9.1
Task Categories (15 categories, 153 tasks)
Category Tasks Example Platforms
Daily Life 21 Uber Eats, DoorDash, Instacart, Zillow, Craigslist
Entertainment & Hobbies 15 Ticketmaster, AMC Theatres, Topgolf, Crunchyroll
Creation & Initialization 13 Squarespace, Wix, Webflow, Ghost, Substack
Rating & Voting 10 Trustpilot, G2, Goodreads, RateMyProfessors
Travel 9 Booking.com, Expedia, Airbnb, TripAdvisor
Education & Learning 9 Coursera, Udemy, Khan Academy, Duolingo
Office & Secretary 9 Google Calendar, Slack, Notion, Trello
Beauty & Personal Care 9 Sephora, Ulta, Glossier
Job Search & HR 8 LinkedIn, Greenhouse, Lever, Workday
Pet & Animal Care 8 Chewy, Petco, Rover
Personal Management 6 Mint, YNAB, Todoist
Shopping & Commerce 6 Amazon, eBay, Etsy, Target
Nonprofit & Charity 6 GoFundMe, DonorsChoose
Academia & Research 5 Google Scholar, Semantic Scholar, OpenReview
Finance & Investment 4 Robinhood, Fidelity, Coinbase
Others 15 Automation, Dev & Tech, Government, Home Services, Automotive

Architecture

Container internals
┌─────────────────────────────────────────────────┐
│  Container (Docker / Podman)                    │
│                                                 │
│  ┌───────────┐   DOM events  ┌──────────────┐   │
│  │ content.js├──────────────►│ background.js│   │
│  │ (per tab) │               │  (service    │   │
│  └───────────┘               │   worker)    │   │
│                              └──┬──────┬────┘   │
│                                 │      │        │
│                         actions │      │ screenshots
│                                 │      │        │
│  ┌──────────┐            ┌──────▼──────▼────┐   │
│  │  Xvfb    │◄──ffmpeg──►│  FastAPI Server  │   │
│  │ :99      │  x11grab   │  :7878           │   │
│  └──────────┘            └──────────────────┘   │
│                                  │              │
│  ┌──────────┐            ┌───────▼─────────┐    │
│  │ Chromium │            │     /data       │    │
│  │ :9222 CDP│            │  actions.jsonl  │    │
│  └──────────┘            │  requests.jsonl │    │
│                          │  screenshots/   │    │
│                          │  recording.mp4  │    │
│                          └─────────────────┘    │
└─────────────────────────────────────────────────┘

CLI

# Interactive TUI (recommended):
./run.sh

# Single run:
uv run --project test-driver test-driver/run.py test-cases/001-daily-life-food-uber-eats claude-sonnet-4-6

# Human mode (you control the browser via noVNC):
uv run --project test-driver test-driver/run.py test-cases/001-daily-life-food-uber-eats --human

# Batch (all models x cases 1-50, 3 concurrent):
uv run --project test-driver test-driver/batch.py --all-models --case-range 1-50 --max-concurrent 3

See test-driver/README.md for full CLI documentation, batch runner flags, test case format, and output structure.


Evaluation

Evaluation is a post-session step -- first run agents to collect trajectories, then evaluate them against human reference runs.

 1. Run agents (test-driver)       2. Evaluate (eval/)
 ─────────────────────────         ────────────────────────────────
 ./run.sh  or  batch.py     ──►    Claude Code subagents compare
 produces test-output/             agent vs human trajectories
   with 5-layer recordings         under eval/agentic_eval.md rubric

The evaluator compares each agent trajectory against a human reference trajectory across all five recording layers (video, screenshots, HTTP traffic, browser actions, agent messages), then outputs PASS/FAIL with evidence-backed justification.

See eval/README.md for the full evaluation guide and Claude Code prompt template.


FAQ

What data does each run produce?

Each session records five layers of synchronized data under /data/:

Layer File Description
Session replay recording.mp4 Full session video (H.264, 15fps)
Action screenshots screenshots/*.png Timestamped PNG per browser action
Browser actions actions.jsonl Every DOM event (click, keydown, input, pageLoad, scroll, etc.)
HTTP traffic requests.jsonl Every HTTP request with headers, body, and query params
Agent messages agent-messages.jsonl Full agent conversation transcript (thinking, text, tool calls)

The interceptor result is saved to interception.json.

How does the request interceptor work?

The interceptor blocks critical, irreversible HTTP requests (checkout, form submit, email send) to prevent real-world side effects. It connects to Chrome via CDP's Fetch domain and matches requests against the eval schema (url_pattern regex + method + optional body/params). When triggered, it saves the blocked request to interception.json, kills the agent, and stops recording.

The interceptor does not validate task completion -- evaluation is handled separately by evaluators post-session.

For tasks behind payment walls (agent has no valid credit card), the eval schema uses a placeholder pattern that never matches, so the session runs until timeout.

What is the synthetic user profile?

Each container gets a /my-info/ directory with a dummy user identity (Alex Green): personal info JSON, email credentials, and a resume PDF. The email is a fresh disposable PurelyMail address generated per run. The agent reads these files when it needs to fill forms, register accounts, etc.

Source templates: shared/alex_green_personal_info.json (profile) and test-driver/resume_template.json (resume).

Can I use Podman instead of Docker?

Yes. Set export CONTAINER_ENGINE=podman. The framework auto-detects whichever is available. Podman works without root privileges.

What tools can the agent use?

The OpenClaw agent can only use the browser tool and a restricted set of read-only shell commands (ls, cat, find, grep, head, tail, jq, wc, etc.). Commands that could bypass the browser (curl, python, node, wget) are blocked. The agent instruction also explicitly requires browser-only task completion.

How do I add a new test case?

See CONTRIBUTING.md. In short: create a directory under test-cases/ with a task.json conforming to test-cases/task.schema.json, define the eval schema, test with human mode, and submit a PR.


Contributing

We welcome contributions -- especially new test cases. See CONTRIBUTING.md.

Citation

If you use ClawBench in your research, please cite:

@misc{zhang2026clawbench,
  title         = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author        = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  year          = {2026},
  eprint        = {2604.08523},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2604.08523}
}

Core Contributors


Yuxuan Zhang

Yubo Wang

Perry Zhu

Penghui Du

Junwen Miao

Advisors


Kelsey R. Allen

Wenhu Chen

Dongfu Jiang

Liang Chen

Star History

ClawBench Star History

License & Acknowledgments

Apache 2.0 -- see LICENSE.

Built with OpenClaw, noVNC (MPL 2.0), and websockify (LGPL 3.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clawbench_eval-0.1.5-py3-none-any.whl (269.2 kB view details)

Uploaded Python 3

File details

Details for the file clawbench_eval-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: clawbench_eval-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 269.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for clawbench_eval-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4276d4e46d6bea2b14ca9fb8a12e170f6354d5e19db0caadf5e927a2fb3f5f08
MD5 8a7b4a528ae3856eee03e190810e688c
BLAKE2b-256 1d4fa9ada358e0f354d856176d85660e1a6e2765e6cc0d9c8543d5fb7476ac74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page