Last released May 22, 2026
Convert academic papers into benchmark tasks for evaluating AI agents.
Supported by