Skip to main content

ScreenSuite

Project description

ScreenSuite

A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our Computer Agent) across areas of ability : perception, single-step and multi-step agentic behaviour.

This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on smolagents.

GUI Agent Benchmarks Overview

Grounding/Perception Benchmarks

Data Source Evaluation Type Platform Link
ScreenSpot BBox + click accuracy Web HuggingFace
ScreenSpot v2 BBox + click accuracy Web HuggingFace
ScreenSpot-Pro BBox + click accuracy Web HuggingFace
Visual-WebBench Multi-task (Caption, OCR, QA, Grounding, Action) Web HuggingFace
WebSRC Web QA Web HuggingFace
ScreenQA-short Mobile QA Mobile HuggingFace
ScreenQA-complex Mobile QA Mobile HuggingFace
Showdown-Clicks Click prediction Web HuggingFace

Single Step - Offline Agent Benchmarks

Data Source Evaluation Type Platform Link
Multimodal-Mind2Web Web navigation Web HuggingFace
AndroidControl Mobile control Mobile GitHub

Multi-step - Online Agent Benchmarks

Data Source Evaluation Type Platform Link
Mind2Web-Live URL matching Web HuggingFace
GAIA Exact match Web HuggingFace
BrowseComp LLM judge Web Link
AndroidWorld Task-specific Mobile GitHub
MobileMiniWob Task-specific Mobile Included in AndroidWorld GitHub
OSWorld Task-specific Desktop GitHub

Cloning the Repository

Make sure to clone the repository with submodules required:

git clone --recurse-submodules git@github.com:huggingface/geekagents.git

or

git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules

Requirements

  • Docker
  • Python >= 3.11
  • uv

For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled. To check if your hosting platform supports KVM, run

egrep -c '(vmx|svm)' /proc/cpuinfo

on Linux. If the return value is greater than zero, the processor should be able to support KVM. Note: macOS hosts generally do not support KVM.

Installation

# Using uv (faster)
uv sync --extra submodules --python 3.11

If you encounter issues with evdev python package, you can try installing the build-essential package:

sudo apt-get install build-essential

Development

# Install development dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Code quality
uv run pre-commit run --all-files --show-diff-on-failure

Running the benchmarks

#!/usr/bin/env python
import os
import json
from datetime import datetime
from dotenv import load_dotenv
from smolagents.models import InferenceClientModel, OpenAIServerModel, LiteLLMModel
from screensuite import registry
from screensuite.basebenchmark import EvaluationConfig

load_dotenv()

# Setup results directory
RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results")
os.makedirs(RESULTS_DIR, exist_ok=True)

def run_benchmarks():
    # Get benchmarks to run
    # benchmarks = registry.list_all()
    benchmarks = registry.get_by_tags(
        tags=[
            "screenqa_short",
            "screenqa_complex",
            "screenspot-v1-click-prompt",
            "screenspot-v1-bounding-box-prompt",
            "screenspot-v2-click-prompt",
            "screenspot-v2-bounding-box-prompt",
            "screenspot-pro-click-prompt",
            "screenspot-pro-bounding-box-prompt",
            "websrc_dev",
            "visualwebbench",
            "android_control",
            "showdown_clicks",
            "mmind2web",
            "android_world",
            "osworld",
            "gaia_web",
        ]
    )

    for bench in benchmarks:
        print(bench.name)

    # Configure your model (choose one)
    model = InferenceClientModel(
        model_id="Qwen/Qwen2.5-VL-32B-Instruct",
        provider="fireworks-ai",
        max_tokens=4096,
    )

    # Alternative models:
    # model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096)
    # model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096)
    # see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py

    # Run benchmarks
    run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}"
    max_samples_to_test = 200
    parallel_workers = 4
    osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker")

    for benchmark in benchmarks:
        print(f"Running: {benchmark.name}")

        # Configure based on benchmark type
        config = EvaluationConfig(
            parallel_workers=parallel_workers,
            run_name=run_name,
            max_samples_to_test=max_samples_to_test
        )

        try:
            results = benchmark.evaluate(
                model,
                evaluation_config=config,
                env_config=osworld_env_config if "osworld" in benchmark.tags else None,
            )
            print(f"Results: {results._metrics}")

            # Save results
            with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f:
                entry = {"benchmark_name": benchmark.name, "metrics": results._metrics}
                f.write(json.dumps(entry) + "\n")

        except Exception as e:
            print(f"Error in {benchmark.name}: {e}")
            continue

if __name__ == "__main__":
    run_benchmarks()

OSWorld Google Tasks

To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project. See OSWorld documentation for more details.

License

This project is licensed under the terms of the Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

screenagents-0.0.1.tar.gz (13.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

screenagents-0.0.1-py3-none-any.whl (3.0 MB view details)

Uploaded Python 3

File details

Details for the file screenagents-0.0.1.tar.gz.

File metadata

  • Download URL: screenagents-0.0.1.tar.gz
  • Upload date:
  • Size: 13.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for screenagents-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0b831248042df56a44dfc1cc5b2c6478d4f7a37d3deab0e8983fbf24c0f7ad21
MD5 62d831dd749165c7cab8f90279dcd86c
BLAKE2b-256 5e57f4a6634abfe6f285a6767255358ad5585509c0450b7306498756950cbec1

See more details on using hashes here.

File details

Details for the file screenagents-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: screenagents-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for screenagents-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0d24de5048e027c785606282a4f17d35c7595d93b4cc77797faf2feca33e86af
MD5 b1d57321a09e1ed045ccf33de72b53db
BLAKE2b-256 5d8f49f41f1ffa4b430331a4caf528aab198046a8ab9db5d07416a666978122c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page