Standardized evaluation benchmark for the Claw ecosystem
Project description
Claw Bench
A standardized evaluation benchmark for the Claw ecosystem.
Claw Bench provides a reproducible, container-isolated harness for measuring how well AI agent frameworks perform across real-world desktop and application tasks.
Documentation | Leaderboard | Chinese / 中文
Quick Start
# 1. Install
pip install claw-bench
# 2. Run the benchmark
claw-bench run --adapter openclaw --tasks all
# 3. Submit results to the leaderboard
claw-bench submit results/<run-id>.json
Features
- Reproducible evaluation -- every task runs in a Docker container with a deterministic initial state.
- Multi-framework support -- pluggable adapter system lets you benchmark any Claw-compatible agent framework.
- Rich task library -- curated tasks spanning productivity apps, coding, web browsing, system administration, and more.
- Automated scoring -- objective rubrics with both binary and partial-credit metrics.
- CLI-first workflow -- validate tasks, run suites, and submit results from the command line.
- Encrypted ground truth -- answer keys are age-encrypted so agents cannot peek at solutions.
Supported Frameworks
| Framework | Adapter Name | Status | Language |
|---|---|---|---|
| OpenClaw | openclaw |
Supported | TypeScript |
| IronClaw | ironclaw |
Supported | Rust |
| ZeroClaw | zeroclaw |
Supported | Rust |
| QClaw | qclaw |
Supported | TypeScript |
| NullClaw | nullclaw |
Supported | Zig |
| PicoClaw | picoclaw |
Supported | Go |
| NanoBot | nanobot |
Supported | Python |
| DryRun | dryrun |
Built-in | Python (oracle) |
The dryrun adapter runs oracle solutions directly for infrastructure validation. Register additional frameworks by implementing the ClawAdapter interface and adding an entry point. See CONTRIBUTING.md for details.
Task Library
210 tasks across 14 domains and 4 difficulty levels (L1–L4):
| Domain | Tasks | L1 | L2 | L3 | L4 |
|---|---|---|---|---|---|
| Calendar | 15 | 5 | 5 | 3 | 2 |
| Code Assistance | 15 | 3 | 6 | 4 | 2 |
| Communication | 15 | 3 | 5 | 6 | 1 |
| Cross-Domain | 15 | 0 | 0 | 8 | 7 |
| Data Analysis | 15 | 3 | 4 | 6 | 2 |
| Document Editing | 15 | 4 | 6 | 4 | 1 |
| 15 | 3 | 6 | 5 | 1 | |
| File Operations | 15 | 6 | 5 | 3 | 1 |
| Memory | 15 | 1 | 6 | 7 | 1 |
| Multimodal | 15 | 1 | 6 | 7 | 1 |
| Security | 15 | 3 | 5 | 4 | 3 |
| System Admin | 15 | 3 | 6 | 5 | 1 |
| Web Browsing | 15 | 3 | 6 | 5 | 1 |
| Workflow Automation | 15 | 2 | 6 | 6 | 1 |
| Total | 210 | 40 | 72 | 73 | 25 |
Fair Evaluation Design
Claw Bench addresses the key challenge of comparing frameworks with different Skills ecosystems and model preferences:
- Skills 3-Condition Comparison (SkillsBench methodology): Each task is tested in
vanilla(no skills),curated(Claw Bench standard skills), andnative(framework's own skills) modes to isolate framework capability from ecosystem size. - Model Standardization: Canonical model tiers (flagship/standard/economy/opensource) ensure fair cross-framework comparison. Frameworks are also tested with their best model configuration.
- Cost-Performance Pareto Frontier: Visualize optimal framework choices at any budget constraint.
- Multi-Dimensional Scoring: Task completion (40%), efficiency (20%), security (15%), skills efficacy (15%), UX (10%) with switchable weight profiles.
Project Structure
claw_bench/
src/claw_bench/ # Core library and CLI
adapters/ # Framework adapters (openclaw, ironclaw, zeroclaw)
core/ # Runner, verifier, scorer, metrics
cli/ # Command-line interface
tasks/ # 210 task definitions across 14 domains
_schema/ # JSON Schema for task validation
skills/curated/ # Curated skills for fair cross-framework testing
config/ # Model tiers and skills profile config
tests/ # Test suite (781 tests, 98% coverage)
leaderboard/ # Next.js leaderboard frontend
docs/ # Documentation
docker/ # Container images
Development
git clone https://github.com/claw-bench/claw-bench.git
cd claw-bench
pip install -e ".[dev]"
pytest
See CONTRIBUTING.md for the full contribution guide.
License
Apache-2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file claw_bench-0.1.0.tar.gz.
File metadata
- Download URL: claw_bench-0.1.0.tar.gz
- Upload date:
- Size: 724.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17c0c6f151248c3f43cb84e1db99a6ef7ce7578506a27de30cabc67a8eeb63c5
|
|
| MD5 |
f19f996e5dcebb4fc0f0aba9d6be4ec4
|
|
| BLAKE2b-256 |
12dba0426b13986e8932e36f58b840ff644815d550252433f86a338a541826d0
|
File details
Details for the file claw_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: claw_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53a27f1432443d61de28c4a5811f5bbfd2425f96606353b323592f689474c00c
|
|
| MD5 |
bc9bfa806459082789e1fadb39ec4176
|
|
| BLAKE2b-256 |
801a50221497849c589c676817b8ac4ca7a9676b6225e42ddb61118aa9925e49
|