WorkArena benchmark for BrowserGym

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

aldro61 gasse

These details have not been verified by PyPI

Development Status
- 2 - Pre-Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

WorkArena: A Benchmark for Evaluating Agents on Knowledge Work Tasks

[Benchmark Contents] ♦ [Getting Started] ♦ [Live Demo] ♦ [BrowserGym] ♦ [Citing This Work] ♦ Join us on Discord!

Explore the BrowserGym Ecosystem

Looking for more tools and resources? Check out these open-source projects:

AgentLab
BrowserGym

Both are part of the broader BrowserGym ecosystem

Papers

[ICML 2024] WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? [Paper]
[NeurIPS 2024] WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [Paper]

WorkArena is a suite of browser-based tasks tailored to gauge web agents' effectiveness in supporting routine tasks for knowledge workers. By harnessing the ubiquitous ServiceNow platform, this benchmark will be instrumental in assessing the widespread state of such automations in modern knowledge work environments.

The preferred way to evaluate on WorkArena is with AgentLab which will conduct parallel experiments through BrowserGym and report on a unified leaderboard.

https://github.com/ServiceNow/WorkArena/assets/2374980/68640f09-7d6f-4eb1-b556-c294a6afef70

Getting Started

To setup WorkArena, you will need to gain access to ServiceNow instances and install our Python package locally. Follow the steps below to achieve this.

a) Gain Access to ServiceNow Instances

Navigate to https://huggingface.co/datasets/ServiceNow/WorkArena-Instances.
Fill the form, accept the terms to gain access to the gated repository and wait for approval.
Ensure that the machine where you will run WorkArena is authenticated with Hugging Face (e.g., via huggingface-cli login or the HUGGING_FACE_HUB_TOKEN environment variable).
Unset any previous WorkArena environment variables if you are upgrading from a previous install (SNOW_INSTANCE_URL, etc.)

b) Install WorkArena

Run the following command to install WorkArena in the BrowswerGym environment:

pip install browsergym-workarena

Then, install Playwright:

playwright install

Your installation is now complete! 🎉

Benchmark Contents

At the moment, WorkArena-L1 includes 19,912 unique instances drawn from 33 tasks that cover the main components of the ServiceNow user interface, otherwise referred to as "atomic" tasks. WorkArena++ contains 682 tasks, each one sampling among thousands of potential configurations. WorkArena++ uses the atomic components presented in WorkArena, and composes them into real-world use cases evaluating planning, reasoning, and memorizing abilities of agents.

The following videos show an agent built on GPT-4-vision interacting with every atomic component of the benchmark. As emphasized by our results, this benchmark is not solved and thus, the performance of the agent is not always on point.

Knowledge Bases

Goal: The agent must search for specific information in the company knowledge base.

The agent interacts with the user via BrowserGym's conversational interface.

https://github.com/ServiceNow/WorkArena/assets/1726818/352341ba-b501-46ac-bfa6-a6c9be1ac2b7

Forms

Goal: The agent must fill a complex form with specific values for each field.

https://github.com/ServiceNow/WorkArena/assets/1726818/e2c2b5cb-3386-4f3c-b073-c8c619e0e81b

Service Catalogs

Goal: The agent must order items with specific configurations from the company's service catalog.

https://github.com/ServiceNow/WorkArena/assets/1726818/ac64db3b-9abf-4b5f-84a7-e2d9c9cee863

Lists

Goal: The agent must filter a list according to some specifications.

In this example, the agent struggles to manipulate the UI and fails to create the filter.

https://github.com/ServiceNow/WorkArena/assets/1726818/7538b3ef-d39b-4978-b9ea-8b9e106df28e

Menus

Goal: The agent must navigate to a specific application using the main menu.

https://github.com/ServiceNow/WorkArena/assets/1726818/ca26dfaf-2358-4418-855f-80e482435e6e

Dashboards

Goal: The agent must answer a question that requires reading charts and (optionally) performing simple reasoning over them.

Note: For demonstration purposes, a human is controlling the cursor since this is a pure retrieval task

https://github.com/ServiceNow/WorkArena/assets/1726818/0023232c-081f-4be4-99bd-f60c766e6c3f

Live Demo

Run this code to see WorkArena in action.

Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to env.step() must be used instead.

To run a demo of WorkArena-L1 (ICML 2024) tasks using BrowserGym, use the following script:

import random

from browsergym.core.env import BrowserEnv
from browsergym.workarena import ATOMIC_TASKS
from time import sleep


random.shuffle(ATOMIC_TASKS)
for task in ATOMIC_TASKS:
    print("Task:", task)

    # Instantiate a new environment
    env = BrowserEnv(task_entrypoint=task,
                    headless=False)
    env.reset()

    # Cheat functions use Playwright to automatically solve the task
    env.chat.add_message(role="assistant", msg="On it. Please wait...")
    cheat_messages = []
    env.task.cheat(env.page, cheat_messages)

    # Send cheat messages to chat
    for cheat_msg in cheat_messages:
        env.chat.add_message(role=cheat_msg["role"], msg=cheat_msg["message"])

    # Post solution to chat
    env.chat.add_message(role="assistant", msg="I'm done!")

    # Validate the solution
    reward, stop, message, info = env.task.validate(env.page, cheat_messages)
    if reward == 1:
        env.chat.add_message(role="user", msg="Yes, that works. Thanks!")
    else:
        env.chat.add_message(role="user", msg=f"No, that doesn't work. {info.get('message', '')}")

    sleep(3)
    env.close()

To run a demo of WorkArena-L2 (WorkArena++) tasks using BrowserGym, use the following script. Change the filter on line 6 to l3 to sample L3 tasks.

import random

from browsergym.core.env import BrowserEnv
from browsergym.workarena import get_all_tasks_agents
 
AGENT_L2_SAMPLED_SET = get_all_tasks_agents(filter="l2")
 
AGENT_L2_SAMPLED_TASKS, AGENT_L2_SEEDS = [sampled_set[0] for sampled_set in AGENT_L2_SAMPLED_SET], [
    sampled_set[1] for sampled_set in AGENT_L2_SAMPLED_SET
]
from time import sleep

for (task, seed) in zip(AGENT_L2_SAMPLED_TASKS, AGENT_L2_SEEDS):
    print("Task:", task)

    # Instantiate a new environment
    env = BrowserEnv(task_entrypoint=task,
                    headless=False)
    env.reset()

    # Cheat functions use Playwright to automatically solve the task
    env.chat.add_message(role="assistant", msg="On it. Please wait...")
    
    for i in range(len(env.task)):
        sleep(1)
        env.task.cheat(page=env.page, chat_messages=env.chat.messages, subtask_idx=i)
        sleep(1)
        reward, done, message, info = env.task.validate(page=env.page, chat_messages=env.chat.messages)
   
    if reward == 1:
        env.chat.add_message(role="user", msg="Yes, that works. Thanks!")
    else:
        env.chat.add_message(role="user", msg=f"No, that doesn't work. {info.get('message', '')}")

    sleep(3)
    env.close()

Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to env.step() must be used instead.

Citing This Work

Please use the following BibTeX to cite our work:

WorkArena

@misc{workarena2024,
      title={WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?}, 
      author={Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste},
      year={2024},
      eprint={2403.07718},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

WorkArena++

@misc{boisvert2024workarenacompositionalplanningreasoningbased,
      title={WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks}, 
      author={Léo Boisvert and Megh Thakkar and Maxime Gasse and Massimo Caccia and Thibault Le Sellier De Chezelles and Quentin Cappart and Nicolas Chapados and Alexandre Lacoste and Alexandre Drouin},
      year={2024},
      eprint={2407.05291},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.05291}, 
}

Project details

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

aldro61 gasse

These details have not been verified by PyPI

Development Status
- 2 - Pre-Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.5.3

Feb 3, 2026

0.5.2

Jan 22, 2026

0.5.1

Dec 6, 2025

0.5.0

Nov 24, 2025

0.4.4

Aug 13, 2025

0.4.3

Jul 15, 2025

0.4.2

Jul 4, 2025

0.4.1

Oct 7, 2024

0.3.2

Oct 7, 2024

0.3.1

Jun 18, 2024

0.3.0

Jun 17, 2024

0.2.1

May 10, 2024

0.2.0

May 9, 2024

0.1.0rc7 pre-release

Mar 25, 2024

0.1.0rc6 pre-release

Mar 20, 2024

0.1.0rc5 pre-release

Mar 18, 2024

0.1.0rc4 pre-release

Mar 14, 2024

0.1.0rc3 pre-release

Mar 13, 2024

0.1.0rc2 pre-release

Mar 13, 2024

0.1.0rc1 pre-release

Mar 13, 2024

0.1.0rc0 pre-release

Mar 13, 2024

0.0.1a10 pre-release

Mar 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

browsergym_workarena-0.5.3.tar.gz (6.6 MB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

browsergym_workarena-0.5.3-py3-none-any.whl (6.8 MB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file browsergym_workarena-0.5.3.tar.gz.

File metadata

Download URL: browsergym_workarena-0.5.3.tar.gz
Upload date: Feb 3, 2026
Size: 6.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for browsergym_workarena-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`40f643e23ece3f57e89bc66154843d92d5f7c391d793c06df8da2a82f0a143a6`
MD5	`acb7dc8414f3745aff2dde5f8c57bd39`
BLAKE2b-256	`37d6a3949c18d52ba8c5100b85d79b13a57dcee0d84f578738fb349e398a68ae`

See more details on using hashes here.

Provenance

The following attestation bundles were made for browsergym_workarena-0.5.3.tar.gz:

Publisher: pypi.yml on ServiceNow/WorkArena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: browsergym_workarena-0.5.3.tar.gz
- Subject digest: 40f643e23ece3f57e89bc66154843d92d5f7c391d793c06df8da2a82f0a143a6
- Sigstore transparency entry: 908557959
- Sigstore integration time: Feb 3, 2026
Source repository:
- Permalink: ServiceNow/WorkArena@a772230a94cf1caf4166b8ead3983f3b3786455b
- Branch / Tag: refs/tags/v0.5.3
- Owner: https://github.com/ServiceNow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@a772230a94cf1caf4166b8ead3983f3b3786455b
- Trigger Event: push

File details

Details for the file browsergym_workarena-0.5.3-py3-none-any.whl.

File metadata

Download URL: browsergym_workarena-0.5.3-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 6.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for browsergym_workarena-0.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb0bdb58598f5e8aeb8b9f01ac3a68dc26d477e3067f01bb56f640a1016e81a5`
MD5	`1370f245750c4658acedd5e0041ece9b`
BLAKE2b-256	`705b518088566572b6cbc28b8782041ba015c02dd04dcbfcfcd4844556ab1500`

See more details on using hashes here.

Provenance

The following attestation bundles were made for browsergym_workarena-0.5.3-py3-none-any.whl:

Publisher: pypi.yml on ServiceNow/WorkArena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: browsergym_workarena-0.5.3-py3-none-any.whl
- Subject digest: eb0bdb58598f5e8aeb8b9f01ac3a68dc26d477e3067f01bb56f640a1016e81a5
- Sigstore transparency entry: 908557961
- Sigstore integration time: Feb 3, 2026
Source repository:
- Permalink: ServiceNow/WorkArena@a772230a94cf1caf4166b8ead3983f3b3786455b
- Branch / Tag: refs/tags/v0.5.3
- Owner: https://github.com/ServiceNow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@a772230a94cf1caf4166b8ead3983f3b3786455b
- Trigger Event: push

browsergym-workarena 0.5.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

WorkArena: A Benchmark for Evaluating Agents on Knowledge Work Tasks

Explore the BrowserGym Ecosystem

Papers

Getting Started

a) Gain Access to ServiceNow Instances

b) Install WorkArena

Benchmark Contents

Knowledge Bases

Forms

Service Catalogs

Lists

Menus

Dashboards

Live Demo

Citing This Work

WorkArena

WorkArena++

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance