Last released Jun 28, 2026
Personal eval benchmark: compare model outcomes across swappable CLI-agent harnesses on custom tasks.
Supported by