14 projects
benchflow
Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.
gicsbench
GICS benchmarking framework for AI agents
govbench
Governance benchmarking framework for AI agents
benchskills
Agent skills benchmarking framework
neoswe
NeoSWE — next-generation software engineering benchmark
clawsbench
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
onlylabs
OnlyLabs
srbench
SRBench — self-rewarding benchmark
selfreward
SelfReward — self-rewarding agents
autoreward
AutoReward — automated reward modeling
rsibench
RSI Bench — agent benchmark suite
skillsbench
Skillsbench - A placeholder package
pokemon-gym
A placeholder package for pokemon-gym
benchmarkthing
Evals as an API - The easiest way to evaluate and benchmark AI models and systems