ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
Project description
ClawsBench
Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
ClawsBench evaluates LLM agents on realistic productivity tasks across 5 high-fidelity mock services (Gmail, Calendar, Docs, Drive, Slack), measuring both capability (task success) and safety (harmful action prevention).
- 44 tasks: 30 single-service + 14 cross-service, including 24 safety-critical scenarios
- 6 models: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Gemini 3.1 Flash-Lite, GLM-5
- 4 harnesses: OpenClaw, Claude Code, Codex, Gemini CLI
- 33 conditions, 7,224 trials
Links
- GitHub: https://github.com/benchflow-ai/ClawsBench
- Paper: https://arxiv.org/abs/2604.05172
- Dataset: https://huggingface.co/datasets/benchflow/ClawsBench
- Website: https://clawsbench.benchflow.ai
- Discord: https://discord.gg/mZ9Rc8q8W3
Citation
@misc{li2026clawsbenchevaluatingcapabilitysafety,
title={ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces},
author={Xiangyi Li and Kyoung Whan Choe and Yimin Liu and Xiaokun Chen and Chujun Tao and Bingran You and Wenbo Chen and Zonglin Di and Jiankai Sun and Shenghan Zheng and Jiajun Bao and Yuanli Wang and Weixiang Yan and Yiyuan Li and Han-chung Lee},
year={2026},
eprint={2604.05172},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.05172},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clawsbench-0.2.0.tar.gz.
File metadata
- Download URL: clawsbench-0.2.0.tar.gz
- Upload date:
- Size: 1.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3630aafb5601667df3320c5ac2e83d9f76d59c9e3e871d4070ab70235940178
|
|
| MD5 |
9ee0a998b8128ddb5df6c1376b8a86d0
|
|
| BLAKE2b-256 |
8c66fe7282475fbc838b98afc987ebefa7211e54943c1987a258073a1defcda1
|
File details
Details for the file clawsbench-0.2.0-py3-none-any.whl.
File metadata
- Download URL: clawsbench-0.2.0-py3-none-any.whl
- Upload date:
- Size: 2.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e5ed01eb1f6a24f40c2f496d8faa0edd614acc5e1ee31463dfbbbfb28efe153
|
|
| MD5 |
b5b2e51ba163b0ed00313d04a572614b
|
|
| BLAKE2b-256 |
b15d22e68abd716fadb2c72a9de2885e46764ab54ddcdf83020ad05dcd77ad26
|