21 projects
gicsbench
GICS benchmarking framework for AI agents
govbench
Governance benchmarking framework for AI agents
benchflow
Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.
benchskills
Agent skills benchmarking framework
neoswe
NeoSWE — next-generation software engineering benchmark
clawsbench
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
onlylabs
OnlyLabs
srbench
SRBench — self-rewarding benchmark
selfrewardbench
SelfRewardBench — benchmark for self-rewarding agents
selfreward
SelfReward — self-rewarding agents
autoreward
AutoReward — automated reward modeling
rsibench
RSI Bench — agent benchmark suite
clawverse
ClawVerse — composable agent task universes
clawuniverse
ClawUniverse — a universe of agent environments
smolclaws
Mock environments for AI agent testing. https://smolclaw.com
skillsbench
Skillsbench - A placeholder package
computer-use-core
A placeholder package for computer-use-core
pokemon-gym
A placeholder package for pokemon-gym
comp-use
A placeholder package for comp-use
computer-gym
A placeholder package for computer-gym
benchmarkthing
Evals as an API - The easiest way to evaluate and benchmark AI models and systems