Skip to main content

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Project description

ClawsBench

Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

arXiv Website Dataset

ClawsBench evaluates LLM agents on realistic productivity tasks across 5 high-fidelity mock services (Gmail, Calendar, Docs, Drive, Slack), measuring both capability (task success) and safety (harmful action prevention).

  • 44 tasks: 30 single-service + 14 cross-service, including 24 safety-critical scenarios
  • 6 models: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Gemini 3.1 Flash-Lite, GLM-5
  • 4 harnesses: OpenClaw, Claude Code, Codex, Gemini CLI
  • 33 conditions, 7,224 trials

Links

Citation

@misc{li2026clawsbenchevaluatingcapabilitysafety,
      title={ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces}, 
      author={Xiangyi Li and Kyoung Whan Choe and Yimin Liu and Xiaokun Chen and Chujun Tao and Bingran You and Wenbo Chen and Zonglin Di and Jiankai Sun and Shenghan Zheng and Jiajun Bao and Yuanli Wang and Weixiang Yan and Yiyuan Li and Han-chung Lee},
      year={2026},
      eprint={2604.05172},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.05172}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clawsbench-0.2.0.tar.gz (1.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clawsbench-0.2.0-py3-none-any.whl (2.3 kB view details)

Uploaded Python 3

File details

Details for the file clawsbench-0.2.0.tar.gz.

File metadata

  • Download URL: clawsbench-0.2.0.tar.gz
  • Upload date:
  • Size: 1.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for clawsbench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b3630aafb5601667df3320c5ac2e83d9f76d59c9e3e871d4070ab70235940178
MD5 9ee0a998b8128ddb5df6c1376b8a86d0
BLAKE2b-256 8c66fe7282475fbc838b98afc987ebefa7211e54943c1987a258073a1defcda1

See more details on using hashes here.

File details

Details for the file clawsbench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: clawsbench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 2.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for clawsbench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e5ed01eb1f6a24f40c2f496d8faa0edd614acc5e1ee31463dfbbbfb28efe153
MD5 b5b2e51ba163b0ed00313d04a572614b
BLAKE2b-256 b15d22e68abd716fadb2c72a9de2885e46764ab54ddcdf83020ad05dcd77ad26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page