A verified version of the WebArena Benchmark

Project description

WebArena-Verified

WebArena-Verified is the verified release of the WebArena benchmark. It distributes a curated, version-controlled dataset of web tasks together with deterministic evaluators that operate on agent responses and captured network traces. The project is designed for reproducible benchmarking of web agents and provides tooling for both single-task debugging and batch evaluation.

📖 Documentation

📢 Announcements

February 2, 2026: Optimized Docker images for all WebArena environments are now available on Docker Hub! Images are up to 92% smaller than originals, include auto-login headers, plus a single container for Map (beta) (previously 5 separate containers). See the Environments documentation.
February 2, 2026: WebArena-Verified is now available via Docker and uvx! Run uvx webarena-verified --help or docker run ghcr.io/servicenow/webarena-verified:latest --help to get started.
January 7, 2026: WebArena-Verified is now available on PyPI! Install it easily with pip install webarena-verified.
December 2, 2025: We are presenting WebArena-Verified at the Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025 on December 7th in San Diego. Come see us!
November 12, 2024: Started initial release with collaborators to gather early feedback, catch any issues, and clarify the documentation. Public release scheduled for December 4th, 2025.

🎯 Highlights

Fully audited benchmark: Every task, reference answer, and evaluator has been manually reviewed and corrected
Offline evaluation: Evaluate agent runs without requiring live web environments using network trace replay
Deterministic scoring: Removed LLM-as-a-judge evaluation and substring matching in favor of type-aware normalization and structural comparison
WebArena-Verified Hard subset: A difficulty-prioritized 258-task subset for cost-effective evaluation

Usage

uvx (no install)

uvx webarena-verified COMMAND [ARGS]

pip / uv (project dependency)

# Setup (choose one)
pip install webarena-verified
# uv add webarena-verified

# Usage
webarena-verified COMMAND [ARGS]
# or (inside uv-managed project)
uv run webarena-verified COMMAND [ARGS]

Docker

# Usage
docker run --rm ghcr.io/servicenow/webarena-verified:latest COMMAND [ARGS]

Example:

uvx webarena-verified eval-tasks --task-ids 108 --output-dir examples/agent_logs/demo

Dataset

WebArena-Verified provides:

Full dataset: the complete benchmark with all 812 verified tasks across supported sites.
Hard subset: a difficulty-prioritized subset of 258 tasks for faster, lower-cost evaluation.

Full dataset

# From the repo
cat assets/dataset/webarena-verified.json > webarena-verified.json

# From the CLI
webarena-verified dataset-get --output webarena-verified.json

# From Docker
docker run --rm \
  -v "$PWD:/data" \
  ghcr.io/servicenow/webarena-verified:latest \
  dataset-get --output /data/webarena-verified.json

From Hugging Face dataset:

from datasets import load_dataset

dataset = load_dataset("AmineHA/WebArena-Verified", split="full")

Hard subset

# From the CLI
webarena-verified subset-export --name webarena-verified-hard --output webarena-verified-hard.json

# From Docker
docker run --rm \
  -v "$PWD:/data" \
  ghcr.io/servicenow/webarena-verified:latest \
  subset-export --name webarena-verified-hard --output /data/webarena-verified-hard.json

From Hugging Face dataset:

from datasets import load_dataset

dataset = load_dataset("AmineHA/WebArena-Verified", split="hard")

🌐 Environments

Note: We have fixed multiple known issues in several environments. See the Environments documentation for details on fixes and current behavior.

Start and Stop Sites

Run sites with the built-in CLI, or run site containers directly with Docker.

# CLI
webarena-verified env start --site <site>  # sites: shopping, shopping_admin, reddit, gitlab, wikipedia, map
webarena-verified env setup init --site wikipedia --data-dir ./downloads  # data download required
webarena-verified env start --site wikipedia --data-dir ./downloads
webarena-verified env setup init --site map --data-dir ./downloads  # data download required
webarena-verified env start --site map
webarena-verified env stop --site <site>
webarena-verified env stop-all

# Docker
docker run -d --name webarena-verified-shopping -p 7770:80 -p 7771:8877 am1n3e/webarena-verified-shopping
docker run -d --name webarena-verified-shopping_admin -p 7780:80 -p 7781:8877 am1n3e/webarena-verified-shopping_admin
docker run -d --name webarena-verified-reddit -p 9999:80 -p 9998:8877 am1n3e/webarena-verified-reddit
docker run -d --name webarena-verified-gitlab -p 8023:8023 -p 8024:8877 am1n3e/webarena-verified-gitlab

# Wikipedia: requires --data-dir setup and a mounted data volume
docker run -d --name webarena-verified-wikipedia \
  -p 8888:8080 -p 8889:8874 \
  -v /path/to/downloads:/data:ro \
  am1n3e/webarena-verified-wikipedia

# Map: run setup first (webarena-verified env setup init --site map --data-dir ./downloads)
docker run -d --name webarena-verified-map \
  -p 3030:3000 -p 3031:8877 \
  -v webarena-verified-map-tile-db:/data/database \
  -v webarena-verified-map-routing-car:/data/routing/car \
  -v webarena-verified-map-routing-bike:/data/routing/bike \
  -v webarena-verified-map-routing-foot:/data/routing/foot \
  -v webarena-verified-map-nominatim-db:/data/nominatim/postgres \
  -v webarena-verified-map-nominatim-flatnode:/data/nominatim/flatnode \
  -v webarena-verified-map-website-db:/var/lib/postgresql/14/main \
  -v webarena-verified-map-tiles:/data/tiles \
  -v webarena-verified-map-style:/data/style \
  am1n3e/webarena-verified-map

Environment Control

Check status and initialize environments via env-ctrl using CLI or HTTP API.

# CLI (inside a running site container)
docker exec <container> env-ctrl status
docker exec <container> env-ctrl init

# HTTP
curl http://localhost:8877/status
curl -X POST http://localhost:8877/init

Environment Control Dashboard

See the Environments documentation and Environment Control docs for site-specific Docker commands, ports, and credentials.

🧪 Evaluate A Task

Evaluate a task using the CLI or programmatically:

CLI:

webarena-verified eval-tasks \
  --task-ids 108 \
  --output-dir examples/agent_logs/demo \
  --config examples/configs/config.example.json

Library:

Start by creating a WebArenaVerified instance with your environment configuration:

from pathlib import Path
from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig

# Initialize with configuration
config = WebArenaVerifiedConfig(
    environments={
        "__GITLAB__": {
            "urls": ["http://localhost:8012"],
            "credentials": {"username": "root", "password": "demopass"}
        }
    }
)
wa = WebArenaVerified(config=config)

# Get a single task
task = wa.get_task(44)
print(f"Task intent: {task.intent}")

Once you have your agent's output, evaluate it against the task definition:

With Files:

# Evaluate a task with file paths
result = wa.evaluate_task(
    task_id=44,
    agent_response=Path("output/44/agent_response_44.json"),
    network_trace=Path("output/44/network_44.har")
)

print(f"Score: {result.score}, Status: {result.status}")

With Inline Response:

# Evaluate a task with inline response
result = wa.evaluate_task(
    task_id=44,
    agent_response={
        "task_type": "NAVIGATE",
        "status": "SUCCESS",
        "retrieved_data": None
    },
    network_trace=Path("output/44/network_44.har")
)

print(f"Score: {result.score}, Status: {result.status}")

See the Quick Start Guide for a complete walkthrough using example task logs.

📊 Dataset

WebArena Verified dataset is in assets/dataset/webarena-verified.json
The original WebArena dataset is in assets/dataset/test.raw.json (kept for reference)
The WebArena Verified Hard subset task IDs are in assets/dataset/subsets/webarena-verified-hard.json

To export the hard subset's task data:

webarena-verified subset-export --name webarena-verified-hard --output webarena-verified-hard.json

See the documentation for more info.

🤝 Contributing

We welcome improvements to both the dataset and the evaluation tooling. See the Contributing Guide for guidelines, local development tips, and dataset update workflows.

📄 Citation

If you use WebArena-Verified in your research, please cite our paper:

@inproceedings{
hattami2025webarena,
title={WebArena Verified: Reliable Evaluation for Web Agents},
author={Amine El hattami and Megh Thakkar and Nicolas Chapados and Christopher Pal},
booktitle={Workshop on Scaling Environments for Agents},
year={2025},
url={https://openreview.net/forum?id=94tlGxmqkN}
}

🙏 Acknowledgements

We thank Prof. Shuyan Zhou and Prof. Graham Neubig for their valuable guidance and feedback.

Project details

Release history Release notifications | RSS feed

This version

1.2.3

Feb 7, 2026

1.1.2

Feb 6, 2026

1.1.1

Feb 5, 2026

1.0.0

Jan 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webarena_verified-1.2.3.tar.gz (303.9 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webarena_verified-1.2.3-py3-none-any.whl (349.3 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file webarena_verified-1.2.3.tar.gz.

File metadata

Download URL: webarena_verified-1.2.3.tar.gz
Upload date: Feb 7, 2026
Size: 303.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for webarena_verified-1.2.3.tar.gz
Algorithm	Hash digest
SHA256	`668ff976ec000f60592ca14e3d935df48facb0482bc81d3c077258cb2f139869`
MD5	`e34aeffc16bd18392813f734b2974aad`
BLAKE2b-256	`2994f6188ad5de7d299368c55aa0e57fb0ae07dbfc5065f56fab7b58cc800f93`

See more details on using hashes here.

Provenance

The following attestation bundles were made for webarena_verified-1.2.3.tar.gz:

Publisher: release-publish.yml on ServiceNow/webarena-verified

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: webarena_verified-1.2.3.tar.gz
- Subject digest: 668ff976ec000f60592ca14e3d935df48facb0482bc81d3c077258cb2f139869
- Sigstore transparency entry: 927191066
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: ServiceNow/webarena-verified@6473f72db5dcefc97b5725b59e734504edc28a21
- Branch / Tag: refs/tags/v1.2.3
- Owner: https://github.com/ServiceNow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-publish.yml@6473f72db5dcefc97b5725b59e734504edc28a21
- Trigger Event: release

File details

Details for the file webarena_verified-1.2.3-py3-none-any.whl.

File metadata

Download URL: webarena_verified-1.2.3-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 349.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for webarena_verified-1.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91a0297a096a343a13ff5e0ebc35c39b3d926b27c4d5a8a57c833681acbc6b7a`
MD5	`6de2c94d617f73f66517b411cf249f5a`
BLAKE2b-256	`d0d6f69571f12c96f7ae74335a0e48e165ed2bcedd981808e3fe3f64c70ac396`

See more details on using hashes here.

Provenance

The following attestation bundles were made for webarena_verified-1.2.3-py3-none-any.whl:

Publisher: release-publish.yml on ServiceNow/webarena-verified

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: webarena_verified-1.2.3-py3-none-any.whl
- Subject digest: 91a0297a096a343a13ff5e0ebc35c39b3d926b27c4d5a8a57c833681acbc6b7a
- Sigstore transparency entry: 927191083
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: ServiceNow/webarena-verified@6473f72db5dcefc97b5725b59e734504edc28a21
- Branch / Tag: refs/tags/v1.2.3
- Owner: https://github.com/ServiceNow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-publish.yml@6473f72db5dcefc97b5725b59e734504edc28a21
- Trigger Event: release

webarena-verified 1.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

WebArena-Verified

📢 Announcements

🎯 Highlights

Usage

uvx (no install)

pip / uv (project dependency)

Docker

Dataset

Full dataset

Hard subset

🌐 Environments

Start and Stop Sites

Environment Control

🧪 Evaluate A Task

📊 Dataset

🤝 Contributing

📄 Citation

🙏 Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance