ClawBench: Can AI Agents Complete Everyday Online Tasks? (alias of claw-bench)
Project description
ClawBench
git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh
Clone → Run → Done. No API keys. No dataset download. No manual setup.
Can AI Agents Complete Everyday Online Tasks?
We asked 6 frontier AI agents to do what people do every day --
order food, book travel, apply for jobs, write reviews, manage projects.
The best model completed only 33.3% of tasks.
153 everyday tasks · 144 live websites · 15 life categories
Live Websites
Isolated Containers
Request Interceptor
Five-Layer Recording
How It Works
You pick a task ClawBench spins up Agent drives the Interceptor captures
from 153 real-world an isolated Docker browser: navigates, every action across
everyday scenarios container + Chromium fills forms, clicks all 5 layers of data
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ "Book a pet │ ──► │ Container │ ──► │ AI Agent │ ──► │ 5 layers │
│ sitter on │ │ + Chromium │ │ browses the │ │ intercepted │
│ Rover" │ │ + Agent │ │ live site │ │ & recorded │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
LLM Quick Start
Point your coding agent (Claude Code, Cursor, Copilot, etc.) at AGENTS.md and prompt away.
Human Quick Start
git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh
Prerequisites: Python 3.11+, uv, and a container engine — Docker or Podman. ClawBench auto-detects whichever is installed; force one with export CONTAINER_ENGINE=docker or export CONTAINER_ENGINE=podman.
Install Docker or Podman (macOS / Linux / Windows)
macOS
# Option A — Docker Desktop (easiest, includes GUI)
brew install --cask docker
open -a Docker # launch and wait for the whale icon to settle
# Option B — Podman (rootless, no daemon, CLI only)
brew install podman
podman machine init # one-time: downloads the Linux VM image
podman machine start # must be running before any podman command
macOS Podman needs a VM.
brew install podmanalone is not enough — Podman on macOS runs containers inside a small Linux VM, so you mustpodman machine init && podman machine startonce after install orpodman infowill fail withCannot connect to Podman.
Linux (Ubuntu / Debian)
# Option A — Podman (rootless by default, recommended)
sudo apt update && sudo apt install -y podman
# Option B — Docker
sudo apt install -y docker.io
sudo usermod -aG docker $USER # log out / back in so your shell picks up the group
Rootful Docker ownership note: with classic
sudo-docker, files extracted from containers land owned byrooton the host. ClawBench's driver detects this after each run and chownstest-output/back to your user automatically — but if you run other container tooling alongside, rootless Podman (or rootless Docker) avoids the issue entirely.
Windows
# Option A — Docker Desktop (WSL2 backend)
winget install Docker.DockerDesktop
# then launch Docker Desktop from the Start menu and wait for it to be ready
# Option B — Podman
winget install RedHat.Podman
podman machine init
podman machine start
Run the
uv run …commands below from PowerShell, WSL2, or Git Bash. Like macOS, Windows Podman requirespodman machine init && podman machine startbefore its first use.
1. Clone and configure:
git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench
cp models/models.example.yaml models/models.yaml # edit: add your model API keys
# `.env` (PurelyMail creds for disposable-email signups) is already committed
# and works out of the box. Edit it only to override defaults or add HF_TOKEN.
[!NOTE] First run builds a container image (chromium + ffmpeg + noVNC + Node + openclaw, roughly 2 GB download, 5–10 min on a decent connection). You'll see a live progress spinner with the current build step. Subsequent runs reuse the cached layers and finish in seconds.
2. Run your first task (pick one):
[!TIP] Recommended → Interactive TUI guided model + test case selection
./run.shNeeds an interactive terminal. For pipes / CI / non-TTY, call
test-driver/run.pyortest-driver/batch.pydirectly.
(b) Run one specific task against a specific model:
uv run --project test-driver test-driver/run.py \
test-cases/001-daily-life-food-uber-eats claude-sonnet-4-6
Once the container starts, the script prints a noVNC URL (e.g. http://localhost:6080/vnc.html) — open it in your browser to watch the agent operate in real-time. If port 6080 is already in use, an alternative port is chosen automatically.
Results land in test-output/<model>/<timestamp>-001-.../ with the full five-layer recording.
(c) Drive the browser yourself via noVNC — produces a human reference run:
uv run --project test-driver test-driver/run.py \
test-cases/001-daily-life-food-uber-eats --human
Open the noVNC URL the script prints, complete the task by hand, then close the tab. Port is auto-assigned if 6080 is busy.
ClawBench-Lite
New here? Run this first. test-cases/lite.json is a 20-task curated subset of the full 153, selected for household-name sites, real-world relevance, difficulty, and category diversity. It matches the 20-tasks-per-source convention of browser-use/benchmark and gives you a credible signal at a fraction of the full-benchmark cost.
Tier distribution: flagship 9 / core 8 / wildcard 3 — spanning daily life (OpenTable, DoorDash, Instacart, TaskRabbit), entertainment (Eventbrite, Goodreads, Fandango), creation (Asana, Mailchimp, Squarespace), travel (Airbnb), education (LeetCode), dev-tech (GitHub), academia (Overleaf), personal management (1Password), and more. All Lite tasks are judged by eval/agentic_eval.md regardless of url_pattern shape.
See test-cases/lite.schema.json for the manifest shape and the notes field in lite.json for the 4-axis selection rubric + full swap history.
Tutorial
Demos
|
Ordering food on Uber Eats https://github.com/user-attachments/assets/placeholder-uber-eats |
Submitting a job application https://github.com/user-attachments/assets/placeholder-greenhouse |
Each ClawBench run produces a full MP4 session recording. See the project page for all 153 task recordings.
Example Walkthrough
Curious what one task actually looks like, start to finish? Here's task 001 end to end.
The task — from test-cases/001-daily-life-food-uber-eats/task.json:
{
"instruction": "On Uber Eats, order delivery: one Pad Thai, deliver to home address, note \"no peanuts\"",
"time_limit": 30,
"eval_schema": {
"url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__",
"method": "POST"
}
}
The agent gets this instruction verbatim, plus read-only access to /my-info/alex_green_personal_info.json (the dummy user's name, home address, phone, date of birth) and a disposable email account for any sign-in prompt. It has 30 minutes to reach a POST request — any longer and the container is killed.
What the agent does (the happy path):
- Navigates to
ubereats.com - Reads the dummy user's home address from
/my-info/alex_green_personal_info.jsonand enters it in the delivery-address box - Searches for "Pad Thai" in the food search
- Picks a restaurant that has Pad Thai available for delivery to that address
- Opens the item detail page, finds the customization or special-instructions field, enters "no peanuts"
- Adds one to cart, opens the cart, and handles any sign-in prompt using the disposable email credentials
- Reaches checkout, taps Place Order
What the interceptor catches — that final Place Order tap fires a POST request. ClawBench's request interceptor sits in front of the browser and captures the outbound request before it reaches Uber Eats's servers, so the dummy user is never actually charged. At the exact moment of interception, all five recording layers (MP4 video, PNG screenshots, HTTP traffic, browser actions, agent messages) are frozen into /data/.
How the judge decides PASS / FAIL — task 001's url_pattern is the intentional sentinel __PLACEHOLDER_WILL_NOT_MATCH__, which means no request path can mechanically match. The verdict comes from the agentic judge in eval/agentic_eval.md, which replays the five-layer recording against a human reference run and checks four things:
- Did the agent actually reach the final checkout step?
- Is the cart exactly one Pad Thai (not two, not a combo)?
- Is the delivery address the user's home address from
alex_green_personal_info.json? - Does the order carry the "no peanuts" note in the instructions field?
All four must hold for a PASS. Miss any one and it's a FAIL with evidence from the recording pinned to the failing criterion. This per-task rubric is what makes ClawBench judge-sensitive rather than URL-regex-sensitive — see eval/README.md for the full rubric format and eval/agentic_eval.md for the judge prompt.
Results
Success rate (%) of 6 frontier AI agents on ClawBench
| Rank | Model | Overall | Daily | Finance | Work | Dev | Academic | Travel | Social | Pets |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | 33.3 | 44.2 | 50.0 | 19.0 | 11.1 | 50.0 | 23.1 | 38.9 | 18.2 |
| 2 | GLM-5 | 24.2 | 30.8 | 16.7 | 38.1 | 16.7 | 28.6 | 0.0 | 16.7 | 18.2 |
| 3 | Gemini 3 Flash | 19.0 | 15.4 | 33.3 | 23.8 | 22.2 | 28.6 | 30.8 | 11.1 | 0.0 |
| 4 | Claude Haiku 4.5 | 18.3 | 15.4 | 22.2 | 19.0 | 27.8 | 21.4 | 7.7 | 16.7 | 18.2 |
| 5 | GPT-5.4 | 6.5 | 9.6 | 0.0 | 0.0 | 11.1 | 7.1 | 7.7 | 0.0 | 9.1 |
| 6 | Gemini 3.1 Flash Lite | 3.3 | 1.9 | 0.0 | 0.0 | 5.6 | 14.3 | 0.0 | 0.0 | 9.1 |
Task Categories (15 categories, 153 tasks)
| Category | Tasks | Example Platforms |
|---|---|---|
| Daily Life | 21 | Uber Eats, DoorDash, Instacart, Zillow, Craigslist |
| Entertainment & Hobbies | 15 | Ticketmaster, AMC Theatres, Topgolf, Crunchyroll |
| Creation & Initialization | 13 | Squarespace, Wix, Webflow, Ghost, Substack |
| Rating & Voting | 10 | Trustpilot, G2, Goodreads, RateMyProfessors |
| Travel | 9 | Booking.com, Expedia, Airbnb, TripAdvisor |
| Education & Learning | 9 | Coursera, Udemy, Khan Academy, Duolingo |
| Office & Secretary | 9 | Google Calendar, Slack, Notion, Trello |
| Beauty & Personal Care | 9 | Sephora, Ulta, Glossier |
| Job Search & HR | 8 | LinkedIn, Greenhouse, Lever, Workday |
| Pet & Animal Care | 8 | Chewy, Petco, Rover |
| Personal Management | 6 | Mint, YNAB, Todoist |
| Shopping & Commerce | 6 | Amazon, eBay, Etsy, Target |
| Nonprofit & Charity | 6 | GoFundMe, DonorsChoose |
| Academia & Research | 5 | Google Scholar, Semantic Scholar, OpenReview |
| Finance & Investment | 4 | Robinhood, Fidelity, Coinbase |
| Others | 15 | Automation, Dev & Tech, Government, Home Services, Automotive |
Architecture
Container internals
┌─────────────────────────────────────────────────┐
│ Container (Docker / Podman) │
│ │
│ ┌───────────┐ DOM events ┌──────────────┐ │
│ │ content.js├──────────────►│ background.js│ │
│ │ (per tab) │ │ (service │ │
│ └───────────┘ │ worker) │ │
│ └──┬──────┬────┘ │
│ │ │ │
│ actions │ │ screenshots
│ │ │ │
│ ┌──────────┐ ┌──────▼──────▼────┐ │
│ │ Xvfb │◄──ffmpeg──►│ FastAPI Server │ │
│ │ :99 │ x11grab │ :7878 │ │
│ └──────────┘ └──────────────────┘ │
│ │ │
│ ┌──────────┐ ┌───────▼─────────┐ │
│ │ Chromium │ │ /data │ │
│ │ :9222 CDP│ │ actions.jsonl │ │
│ └──────────┘ │ requests.jsonl │ │
│ │ screenshots/ │ │
│ │ recording.mp4 │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────┘
CLI
# Interactive TUI (recommended):
./run.sh
# Single run:
uv run --project test-driver test-driver/run.py test-cases/001-daily-life-food-uber-eats claude-sonnet-4-6
# Human mode (you control the browser via noVNC):
uv run --project test-driver test-driver/run.py test-cases/001-daily-life-food-uber-eats --human
# Batch (all models x cases 1-50, 3 concurrent):
uv run --project test-driver test-driver/batch.py --all-models --case-range 1-50 --max-concurrent 3
See test-driver/README.md for full CLI documentation, batch runner flags, test case format, and output structure.
Evaluation
Evaluation is a post-session step -- first run agents to collect trajectories, then evaluate them against human reference runs.
1. Run agents (test-driver) 2. Evaluate (eval/)
───────────────────────── ────────────────────────────────
./run.sh or batch.py ──► Claude Code subagents compare
produces test-output/ agent vs human trajectories
with 5-layer recordings under eval/agentic_eval.md rubric
The evaluator compares each agent trajectory against a human reference trajectory across all five recording layers (video, screenshots, HTTP traffic, browser actions, agent messages), then outputs PASS/FAIL with evidence-backed justification.
See eval/README.md for the full evaluation guide and Claude Code prompt template.
FAQ
What data does each run produce?
Each session records five layers of synchronized data under /data/:
| Layer | File | Description |
|---|---|---|
| Session replay | recording.mp4 |
Full session video (H.264, 15fps) |
| Action screenshots | screenshots/*.png |
Timestamped PNG per browser action |
| Browser actions | actions.jsonl |
Every DOM event (click, keydown, input, pageLoad, scroll, etc.) |
| HTTP traffic | requests.jsonl |
Every HTTP request with headers, body, and query params |
| Agent messages | agent-messages.jsonl |
Full agent conversation transcript (thinking, text, tool calls) |
The interceptor result is saved to interception.json.
How does the request interceptor work?
The interceptor blocks critical, irreversible HTTP requests (checkout, form submit, email send) to prevent real-world side effects. It connects to Chrome via CDP's Fetch domain and matches requests against the eval schema (url_pattern regex + method + optional body/params). When triggered, it saves the blocked request to interception.json, kills the agent, and stops recording.
The interceptor does not validate task completion -- evaluation is handled separately by evaluators post-session.
For tasks behind payment walls (agent has no valid credit card), the eval schema uses a placeholder pattern that never matches, so the session runs until timeout.
What is the synthetic user profile?
Each container gets a /my-info/ directory with a dummy user identity (Alex Green): personal info JSON, email credentials, and a resume PDF. The email is a fresh disposable PurelyMail address generated per run. The agent reads these files when it needs to fill forms, register accounts, etc.
Source templates: shared/alex_green_personal_info.json (profile) and test-driver/resume_template.json (resume).
Can I use Podman instead of Docker?
Yes. Set export CONTAINER_ENGINE=podman. The framework auto-detects whichever is available. Podman works without root privileges.
What tools can the agent use?
The OpenClaw agent can only use the browser tool and a restricted set of read-only shell commands (ls, cat, find, grep, head, tail, jq, wc, etc.). Commands that could bypass the browser (curl, python, node, wget) are blocked. The agent instruction also explicitly requires browser-only task completion.
How do I add a new test case?
See CONTRIBUTING.md. In short: create a directory under test-cases/ with a task.json conforming to test-cases/task.schema.json, define the eval schema, test with human mode, and submit a PR.
Contributing
We welcome contributions -- especially new test cases. See CONTRIBUTING.md.
Citation
If you use ClawBench in your research, please cite:
@misc{zhang2026clawbench,
title = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
author = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
year = {2026},
eprint = {2604.08523},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2604.08523}
}
Core Contributors
|
Yuxuan Zhang |
Yubo Wang |
Perry Zhu |
Penghui Du |
Junwen Miao |
Advisors
|
Kelsey R. Allen |
Wenhu Chen |
Dongfu Jiang |
Liang Chen |
Star History
License & Acknowledgments
Apache 2.0 -- see LICENSE.
Built with OpenClaw, noVNC (MPL 2.0), and websockify (LGPL 3.0).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clawbench_eval-0.1.5-py3-none-any.whl.
File metadata
- Download URL: clawbench_eval-0.1.5-py3-none-any.whl
- Upload date:
- Size: 269.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4276d4e46d6bea2b14ca9fb8a12e170f6354d5e19db0caadf5e927a2fb3f5f08
|
|
| MD5 |
8a7b4a528ae3856eee03e190810e688c
|
|
| BLAKE2b-256 |
1d4fa9ada358e0f354d856176d85660e1a6e2765e6cc0d9c8543d5fb7476ac74
|