ScreenSuite

Project description

ScreenSuite

A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our Computer Agent) across areas of ability : perception, single-step and multi-step agentic behaviour.

This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on smolagents.

GUI Agent Benchmarks Overview

Grounding/Perception Benchmarks

Data Source	Evaluation Type	Platform	Link
ScreenSpot	BBox + click accuracy	Web	HuggingFace
ScreenSpot v2	BBox + click accuracy	Web	HuggingFace
ScreenSpot-Pro	BBox + click accuracy	Web	HuggingFace
Visual-WebBench	Multi-task (Caption, OCR, QA, Grounding, Action)	Web	HuggingFace
WebSRC	Web QA	Web	HuggingFace
ScreenQA-short	Mobile QA	Mobile	HuggingFace
ScreenQA-complex	Mobile QA	Mobile	HuggingFace
Showdown-Clicks	Click prediction	Web	HuggingFace

Single Step - Offline Agent Benchmarks

Data Source	Evaluation Type	Platform	Link
Multimodal-Mind2Web	Web navigation	Web	HuggingFace
AndroidControl	Mobile control	Mobile	GitHub

Multi-step - Online Agent Benchmarks

Data Source	Evaluation Type	Platform	Link
Mind2Web-Live	URL matching	Web	HuggingFace
GAIA	Exact match	Web	HuggingFace
BrowseComp	LLM judge	Web	Link
AndroidWorld	Task-specific	Mobile	GitHub
MobileMiniWob	Task-specific	Mobile	Included in AndroidWorld GitHub
OSWorld	Task-specific	Desktop	GitHub

Cloning the Repository

Make sure to clone the repository with submodules required:

git clone --recurse-submodules git@github.com:huggingface/geekagents.git

git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules

Requirements

Docker
Python >= 3.11
uv

For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled. To check if your hosting platform supports KVM, run
egrep -c '(vmx|svm)' /proc/cpuinfo
on Linux. If the return value is greater than zero, the processor should be able to support KVM. Note: macOS hosts generally do not support KVM.

Installation

# Using uv (faster)
uv sync --extra submodules --python 3.11

If you encounter issues with evdev python package, you can try installing the build-essential package:
sudo apt-get install build-essential

Development

# Install development dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Code quality
uv run pre-commit run --all-files --show-diff-on-failure

Running the benchmarks

#!/usr/bin/env python
import os
import json
from datetime import datetime
from dotenv import load_dotenv
from smolagents.models import InferenceClientModel, OpenAIServerModel, LiteLLMModel
from screensuite import registry
from screensuite.basebenchmark import EvaluationConfig

load_dotenv()

# Setup results directory
RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results")
os.makedirs(RESULTS_DIR, exist_ok=True)

def run_benchmarks():
    # Get benchmarks to run
    # benchmarks = registry.list_all()
    benchmarks = registry.get_by_tags(
        tags=[
            "screenqa_short",
            "screenqa_complex",
            "screenspot-v1-click-prompt",
            "screenspot-v1-bounding-box-prompt",
            "screenspot-v2-click-prompt",
            "screenspot-v2-bounding-box-prompt",
            "screenspot-pro-click-prompt",
            "screenspot-pro-bounding-box-prompt",
            "websrc_dev",
            "visualwebbench",
            "android_control",
            "showdown_clicks",
            "mmind2web",
            "android_world",
            "osworld",
            "gaia_web",
        ]
    )

    for bench in benchmarks:
        print(bench.name)

    # Configure your model (choose one)
    model = InferenceClientModel(
        model_id="Qwen/Qwen2.5-VL-32B-Instruct",
        provider="fireworks-ai",
        max_tokens=4096,
    )

    # Alternative models:
    # model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096)
    # model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096)
    # see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py

    # Run benchmarks
    run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}"
    max_samples_to_test = 200
    parallel_workers = 4
    osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker")

    for benchmark in benchmarks:
        print(f"Running: {benchmark.name}")

        # Configure based on benchmark type
        config = EvaluationConfig(
            parallel_workers=parallel_workers,
            run_name=run_name,
            max_samples_to_test=max_samples_to_test
        )

        try:
            results = benchmark.evaluate(
                model,
                evaluation_config=config,
                env_config=osworld_env_config if "osworld" in benchmark.tags else None,
            )
            print(f"Results: {results._metrics}")

            # Save results
            with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f:
                entry = {"benchmark_name": benchmark.name, "metrics": results._metrics}
                f.write(json.dumps(entry) + "\n")

        except Exception as e:
            print(f"Error in {benchmark.name}: {e}")
            continue

if __name__ == "__main__":
    run_benchmarks()

OSWorld Google Tasks

To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project. See OSWorld documentation for more details.

License

This project is licensed under the terms of the Apache License 2.0.

Project details

Release history Release notifications | RSS feed

This version

0.0.1

Jun 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

screenbench-0.0.1.tar.gz (13.6 MB view details)

Uploaded Jun 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

screenbench-0.0.1-py3-none-any.whl (3.0 MB view details)

Uploaded Jun 5, 2025 Python 3

File details

Details for the file screenbench-0.0.1.tar.gz.

File metadata

Download URL: screenbench-0.0.1.tar.gz
Upload date: Jun 5, 2025
Size: 13.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for screenbench-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`33fe32e94805f1b20417b9aed2391b7a19647c49b73eb05024251aa81796f0d3`
MD5	`0f906cdfac52ad6dbcd26de79093faec`
BLAKE2b-256	`8780fe1baab74ab96d8eea8c13aff84219f9709151b359bca4a2d288b07ea7b7`

See more details on using hashes here.

File details

Details for the file screenbench-0.0.1-py3-none-any.whl.

File metadata

Download URL: screenbench-0.0.1-py3-none-any.whl
Upload date: Jun 5, 2025
Size: 3.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for screenbench-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae28c35889167d9c544a9760cf82776248b86bac37574d16cfaca9f97ab645b2`
MD5	`0c1e557ee0209c621e22ffc315a1b1a5`
BLAKE2b-256	`33e8fbf87f18392179c0d9f3bb6bec5626422c687c42c907d9e0af277755d35f`

See more details on using hashes here.

screenbench 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ScreenSuite

GUI Agent Benchmarks Overview

Grounding/Perception Benchmarks

Single Step - Offline Agent Benchmarks

Multi-step - Online Agent Benchmarks

Cloning the Repository

Requirements

Installation

Development

Running the benchmarks

OSWorld Google Tasks

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes