Standardized evaluation benchmark for the Claw ecosystem

These details have not been verified by PyPI

Project links

Project description

Claw Bench

A standardized evaluation benchmark for the Claw ecosystem.

Claw Bench provides a reproducible, container-isolated harness for measuring how well AI agent frameworks perform across real-world desktop and application tasks.

Documentation | Leaderboard | Chinese / 中文

Quick Start

# 1. Install
pip install claw-bench

# 2. Run the benchmark
claw-bench run --adapter openclaw --tasks all

# 3. Submit results to the leaderboard
claw-bench submit results/<run-id>.json

Features

Reproducible evaluation -- every task runs in a Docker container with a deterministic initial state.
Multi-framework support -- pluggable adapter system lets you benchmark any Claw-compatible agent framework.
Rich task library -- curated tasks spanning productivity apps, coding, web browsing, system administration, and more.
Automated scoring -- objective rubrics with both binary and partial-credit metrics.
CLI-first workflow -- validate tasks, run suites, and submit results from the command line.
Encrypted ground truth -- answer keys are age-encrypted so agents cannot peek at solutions.

Supported Frameworks

Framework	Adapter Name	Status	Language
OpenClaw	`openclaw`	Supported	TypeScript
IronClaw	`ironclaw`	Supported	Rust
ZeroClaw	`zeroclaw`	Supported	Rust
QClaw	`qclaw`	Supported	TypeScript
NullClaw	`nullclaw`	Supported	Zig
PicoClaw	`picoclaw`	Supported	Go
NanoBot	`nanobot`	Supported	Python
DryRun	`dryrun`	Built-in	Python (oracle)

The dryrun adapter runs oracle solutions directly for infrastructure validation. Register additional frameworks by implementing the ClawAdapter interface and adding an entry point. See CONTRIBUTING.md for details.

Task Library

210 tasks across 14 domains and 4 difficulty levels (L1–L4):

Domain	Tasks	L1	L2	L3	L4
Calendar	15	5	5	3	2
Code Assistance	15	3	6	4	2
Communication	15	3	5	6	1
Cross-Domain	15	0	0	8	7
Data Analysis	15	3	4	6	2
Document Editing	15	4	6	4	1
Email	15	3	6	5	1
File Operations	15	6	5	3	1
Memory	15	1	6	7	1
Multimodal	15	1	6	7	1
Security	15	3	5	4	3
System Admin	15	3	6	5	1
Web Browsing	15	3	6	5	1
Workflow Automation	15	2	6	6	1
Total	210	40	72	73	25

Fair Evaluation Design

Claw Bench addresses the key challenge of comparing frameworks with different Skills ecosystems and model preferences:

Skills 3-Condition Comparison (SkillsBench methodology): Each task is tested in vanilla (no skills), curated (Claw Bench standard skills), and native (framework's own skills) modes to isolate framework capability from ecosystem size.
Model Standardization: Canonical model tiers (flagship/standard/economy/opensource) ensure fair cross-framework comparison. Frameworks are also tested with their best model configuration.
Cost-Performance Pareto Frontier: Visualize optimal framework choices at any budget constraint.
Multi-Dimensional Scoring: Task completion (40%), efficiency (20%), security (15%), skills efficacy (15%), UX (10%) with switchable weight profiles.

Project Structure

claw_bench/
  src/claw_bench/       # Core library and CLI
    adapters/           # Framework adapters (openclaw, ironclaw, zeroclaw)
    core/               # Runner, verifier, scorer, metrics
    cli/                # Command-line interface
  tasks/                # 210 task definitions across 14 domains
    _schema/            # JSON Schema for task validation
  skills/curated/       # Curated skills for fair cross-framework testing
  config/               # Model tiers and skills profile config
  tests/                # Test suite (781 tests, 98% coverage)
  leaderboard/          # Next.js leaderboard frontend
  docs/                 # Documentation
  docker/               # Container images

Development

git clone https://github.com/claw-bench/claw-bench.git
cd claw-bench
pip install -e ".[dev]"
pytest

See CONTRIBUTING.md for the full contribution guide.

License

Apache-2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claw_bench-0.1.0.tar.gz (724.8 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

claw_bench-0.1.0-py3-none-any.whl (1.5 MB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file claw_bench-0.1.0.tar.gz.

File metadata

Download URL: claw_bench-0.1.0.tar.gz
Upload date: Mar 13, 2026
Size: 724.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for claw_bench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`17c0c6f151248c3f43cb84e1db99a6ef7ce7578506a27de30cabc67a8eeb63c5`
MD5	`f19f996e5dcebb4fc0f0aba9d6be4ec4`
BLAKE2b-256	`12dba0426b13986e8932e36f58b840ff644815d550252433f86a338a541826d0`

See more details on using hashes here.

File details

Details for the file claw_bench-0.1.0-py3-none-any.whl.

File metadata

Download URL: claw_bench-0.1.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 1.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for claw_bench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53a27f1432443d61de28c4a5811f5bbfd2425f96606353b323592f689474c00c`
MD5	`bc9bfa806459082789e1fadb39ec4176`
BLAKE2b-256	`801a50221497849c589c676817b8ac4ca7a9676b6225e42ddb61118aa9925e49`

See more details on using hashes here.

claw-bench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Claw Bench

Quick Start

Features

Supported Frameworks

Task Library

Fair Evaluation Design

Project Structure

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes