6 projects
harbor
A framework for evaluating and optimizing agents and models using sandboxed environments.
harbor-rewardkit
Lightweight grading toolkit for environment-based tasks.
terminus-ai
Terminus CLI: An autonomous AI agent for terminal-based task execution
terminal-bench
Terminal-bench is a collection of tasks and evaluation harness for evaluating AI agents' ability to complete complex tasks in terminal environments.
sandboxes
Add your description here
benchmarks
A library for building agentic benchmarks.