9 projects
weblinx-browsergym
BrowserGym integration for the WebLINX benchmark
agent-reward-bench
Official library for AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
agent-as-annotators
Agent-as-Annotators: Structured Distillation of Web Agent Capabilities
llm2vec-gen
LLM2Vec-Gen: Generative Embeddings from Large Language Models
webllama
Llama-powered agents for automatic web browsing
instruct-qa
Empirical evaluation of retrieval-augmented instruction-following models.
weblinx
The official weblinx library
statcan-dialogue-dataset
The Statcan Dialogue Dataset
safearena
SafeArena is a benchmark for agent safety