7 projects
weblinx-browsergym
BrowserGym integration for the WebLINX benchmark
agent-reward-bench
Official library for AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
webllama
Llama-powered agents for automatic web browsing
instruct-qa
Empirical evaluation of retrieval-augmented instruction-following models.
weblinx
The official weblinx library
statcan-dialogue-dataset
The Statcan Dialogue Dataset
safearena
SafeArena is a benchmark for agent safety