Skip to main content

Standardized evaluation benchmark for the Claw ecosystem

Project description

Claw Bench

A standardized evaluation benchmark for the Claw ecosystem.

Claw Bench provides a reproducible, container-isolated harness for measuring how well AI agent frameworks perform across real-world desktop and application tasks.

Documentation | Leaderboard | Chinese / 中文


Quick Start

# 1. Install
pip install claw-bench

# 2. Run the benchmark
claw-bench run --adapter openclaw --tasks all

# 3. Submit results to the leaderboard
claw-bench submit results/<run-id>.json

Features

  • Reproducible evaluation -- every task runs in a Docker container with a deterministic initial state.
  • Multi-framework support -- pluggable adapter system lets you benchmark any Claw-compatible agent framework.
  • Rich task library -- curated tasks spanning productivity apps, coding, web browsing, system administration, and more.
  • Automated scoring -- objective rubrics with both binary and partial-credit metrics.
  • CLI-first workflow -- validate tasks, run suites, and submit results from the command line.
  • Encrypted ground truth -- answer keys are age-encrypted so agents cannot peek at solutions.

Supported Frameworks

Framework Adapter Name Status Language
OpenClaw openclaw Supported TypeScript
IronClaw ironclaw Supported Rust
ZeroClaw zeroclaw Supported Rust
QClaw qclaw Supported TypeScript
NullClaw nullclaw Supported Zig
PicoClaw picoclaw Supported Go
NanoBot nanobot Supported Python
DryRun dryrun Built-in Python (oracle)

The dryrun adapter runs oracle solutions directly for infrastructure validation. Register additional frameworks by implementing the ClawAdapter interface and adding an entry point. See CONTRIBUTING.md for details.

Task Library

210 tasks across 14 domains and 4 difficulty levels (L1–L4):

Domain Tasks L1 L2 L3 L4
Calendar 15 5 5 3 2
Code Assistance 15 3 6 4 2
Communication 15 3 5 6 1
Cross-Domain 15 0 0 8 7
Data Analysis 15 3 4 6 2
Document Editing 15 4 6 4 1
Email 15 3 6 5 1
File Operations 15 6 5 3 1
Memory 15 1 6 7 1
Multimodal 15 1 6 7 1
Security 15 3 5 4 3
System Admin 15 3 6 5 1
Web Browsing 15 3 6 5 1
Workflow Automation 15 2 6 6 1
Total 210 40 72 73 25

Fair Evaluation Design

Claw Bench addresses the key challenge of comparing frameworks with different Skills ecosystems and model preferences:

  • Skills 3-Condition Comparison (SkillsBench methodology): Each task is tested in vanilla (no skills), curated (Claw Bench standard skills), and native (framework's own skills) modes to isolate framework capability from ecosystem size.
  • Model Standardization: Canonical model tiers (flagship/standard/economy/opensource) ensure fair cross-framework comparison. Frameworks are also tested with their best model configuration.
  • Cost-Performance Pareto Frontier: Visualize optimal framework choices at any budget constraint.
  • Multi-Dimensional Scoring: Task completion (40%), efficiency (20%), security (15%), skills efficacy (15%), UX (10%) with switchable weight profiles.

Project Structure

claw_bench/
  src/claw_bench/       # Core library and CLI
    adapters/           # Framework adapters (openclaw, ironclaw, zeroclaw)
    core/               # Runner, verifier, scorer, metrics
    cli/                # Command-line interface
  tasks/                # 210 task definitions across 14 domains
    _schema/            # JSON Schema for task validation
  skills/curated/       # Curated skills for fair cross-framework testing
  config/               # Model tiers and skills profile config
  tests/                # Test suite (781 tests, 98% coverage)
  leaderboard/          # Next.js leaderboard frontend
  docs/                 # Documentation
  docker/               # Container images

Development

git clone https://github.com/claw-bench/claw-bench.git
cd claw-bench
pip install -e ".[dev]"
pytest

See CONTRIBUTING.md for the full contribution guide.

License

Apache-2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claw_bench-0.1.0.tar.gz (724.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

claw_bench-0.1.0-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file claw_bench-0.1.0.tar.gz.

File metadata

  • Download URL: claw_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 724.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for claw_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 17c0c6f151248c3f43cb84e1db99a6ef7ce7578506a27de30cabc67a8eeb63c5
MD5 f19f996e5dcebb4fc0f0aba9d6be4ec4
BLAKE2b-256 12dba0426b13986e8932e36f58b840ff644815d550252433f86a338a541826d0

See more details on using hashes here.

File details

Details for the file claw_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: claw_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for claw_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 53a27f1432443d61de28c4a5811f5bbfd2425f96606353b323592f689474c00c
MD5 bc9bfa806459082789e1fadb39ec4176
BLAKE2b-256 801a50221497849c589c676817b8ac4ca7a9676b6225e42ddb61118aa9925e49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page