ScreenSuite
Project description
ScreenSuite
A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our Computer Agent) across areas of ability : perception, single-step and multi-step agentic behaviour.
This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on smolagents.
GUI Agent Benchmarks Overview
Grounding/Perception Benchmarks
| Data Source | Evaluation Type | Platform | Link |
|---|---|---|---|
| ScreenSpot | BBox + click accuracy | Web | HuggingFace |
| ScreenSpot v2 | BBox + click accuracy | Web | HuggingFace |
| ScreenSpot-Pro | BBox + click accuracy | Web | HuggingFace |
| Visual-WebBench | Multi-task (Caption, OCR, QA, Grounding, Action) | Web | HuggingFace |
| WebSRC | Web QA | Web | HuggingFace |
| ScreenQA-short | Mobile QA | Mobile | HuggingFace |
| ScreenQA-complex | Mobile QA | Mobile | HuggingFace |
| Showdown-Clicks | Click prediction | Web | HuggingFace |
Single Step - Offline Agent Benchmarks
| Data Source | Evaluation Type | Platform | Link |
|---|---|---|---|
| Multimodal-Mind2Web | Web navigation | Web | HuggingFace |
| AndroidControl | Mobile control | Mobile | GitHub |
Multi-step - Online Agent Benchmarks
| Data Source | Evaluation Type | Platform | Link |
|---|---|---|---|
| Mind2Web-Live | URL matching | Web | HuggingFace |
| GAIA | Exact match | Web | HuggingFace |
| BrowseComp | LLM judge | Web | Link |
| AndroidWorld | Task-specific | Mobile | GitHub |
| MobileMiniWob | Task-specific | Mobile | Included in AndroidWorld GitHub |
| OSWorld | Task-specific | Desktop | GitHub |
Cloning the Repository
Make sure to clone the repository with submodules required:
git clone --recurse-submodules git@github.com:huggingface/geekagents.git
or
git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules
Requirements
- Docker
- Python >= 3.11
- uv
For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled. To check if your hosting platform supports KVM, run
egrep -c '(vmx|svm)' /proc/cpuinfoon Linux. If the return value is greater than zero, the processor should be able to support KVM. Note: macOS hosts generally do not support KVM.
Installation
# Using uv (faster)
uv sync --extra submodules --python 3.11
If you encounter issues with
evdevpython package, you can try installing the build-essential package:sudo apt-get install build-essential
Development
# Install development dependencies
uv sync --all-extras
# Run tests
uv run pytest
# Code quality
uv run pre-commit run --all-files --show-diff-on-failure
Running the benchmarks
#!/usr/bin/env python
import os
import json
from datetime import datetime
from dotenv import load_dotenv
from smolagents.models import InferenceClientModel, OpenAIServerModel, LiteLLMModel
from screensuite import registry
from screensuite.basebenchmark import EvaluationConfig
load_dotenv()
# Setup results directory
RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results")
os.makedirs(RESULTS_DIR, exist_ok=True)
def run_benchmarks():
# Get benchmarks to run
# benchmarks = registry.list_all()
benchmarks = registry.get_by_tags(
tags=[
"screenqa_short",
"screenqa_complex",
"screenspot-v1-click-prompt",
"screenspot-v1-bounding-box-prompt",
"screenspot-v2-click-prompt",
"screenspot-v2-bounding-box-prompt",
"screenspot-pro-click-prompt",
"screenspot-pro-bounding-box-prompt",
"websrc_dev",
"visualwebbench",
"android_control",
"showdown_clicks",
"mmind2web",
"android_world",
"osworld",
"gaia_web",
]
)
for bench in benchmarks:
print(bench.name)
# Configure your model (choose one)
model = InferenceClientModel(
model_id="Qwen/Qwen2.5-VL-32B-Instruct",
provider="fireworks-ai",
max_tokens=4096,
)
# Alternative models:
# model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096)
# model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096)
# see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py
# Run benchmarks
run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}"
max_samples_to_test = 200
parallel_workers = 4
osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker")
for benchmark in benchmarks:
print(f"Running: {benchmark.name}")
# Configure based on benchmark type
config = EvaluationConfig(
parallel_workers=parallel_workers,
run_name=run_name,
max_samples_to_test=max_samples_to_test
)
try:
results = benchmark.evaluate(
model,
evaluation_config=config,
env_config=osworld_env_config if "osworld" in benchmark.tags else None,
)
print(f"Results: {results._metrics}")
# Save results
with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f:
entry = {"benchmark_name": benchmark.name, "metrics": results._metrics}
f.write(json.dumps(entry) + "\n")
except Exception as e:
print(f"Error in {benchmark.name}: {e}")
continue
if __name__ == "__main__":
run_benchmarks()
OSWorld Google Tasks
To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project. See OSWorld documentation for more details.
License
This project is licensed under the terms of the Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file screenagents-0.0.1.tar.gz.
File metadata
- Download URL: screenagents-0.0.1.tar.gz
- Upload date:
- Size: 13.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b831248042df56a44dfc1cc5b2c6478d4f7a37d3deab0e8983fbf24c0f7ad21
|
|
| MD5 |
62d831dd749165c7cab8f90279dcd86c
|
|
| BLAKE2b-256 |
5e57f4a6634abfe6f285a6767255358ad5585509c0450b7306498756950cbec1
|
File details
Details for the file screenagents-0.0.1-py3-none-any.whl.
File metadata
- Download URL: screenagents-0.0.1-py3-none-any.whl
- Upload date:
- Size: 3.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d24de5048e027c785606282a4f17d35c7595d93b4cc77797faf2feca33e86af
|
|
| MD5 |
b1d57321a09e1ed045ccf33de72b53db
|
|
| BLAKE2b-256 |
5d8f49f41f1ffa4b430331a4caf528aab198046a8ab9db5d07416a666978122c
|