Helper utilities and constants for GroundNext models - Computer Use Agents for grounding tasks
Project description
GroundCUA: Grounding Computer Use Agents on Human Demonstrations
🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models
Authors
Aarash Feizi1,2,4*, Shravan Nayak1,3*,
Xiangru Jian5, Kevin Qinghong Lin6, Kaixin Li6,
Rabiul Awal1,3,4, Xing Han Lù1,2, Johan Obando-Ceron1,3, Juan A. Rodriguez1,8,
Nicolas Chapados4, David Vazquez4, Adriana Romero-Soriano1,2, Reihaneh Rabbany1,2,
Perouz Taslakian4, Christopher Pal4, Spandana Gella4, Sai Rajeswar4,1,3
1Mila - Quebec AI Institute, 2McGill University, 3Université de Montréal,
4ServiceNow Research, 5University of Waterloo, 6National University of Singapore,
7Polytechnique Montréal, 8École de Technologie Supérieure, 9CIFAR AI Chair
*Equal contribution
Updates
- [Nov 11 2025] 🎉 We released our project webpage, the GroundCUA dataset, and the GroundNext-7B model!
Introduction
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. We address this gap through:
- GroundCUA Dataset: A large-scale, human-annotated desktop grounding dataset with 56K screenshots from over 10,000 real-world human tasks across 87 applications and 3.56M+ human-verified annotations
- GroundNext Models: Vision-language models at 3B and 7B scales achieving state-of-the-art results across five benchmarks
- Efficient Training: SOTA performance using one-tenth the training data of prior work
Key Features
🎯 High-Quality Desktop Dataset
- Dense, expert-annotated screenshots with maximum annotation density
- Coverage of almost every visible element, including small icons and controls
- Fine-grained category information (menus, sidebars, etc.) for 50% of UI elements—fully open-source!
⚡ Efficient Model Training
- State-of-the-art performance with 700K datapoints vs 9M+ in prior work
- Two-stage training: supervised fine-tuning + reinforcement learning with fully open-source code
- Models at 3B and 7B scales for efficiency and accuracy
🌐 Cross-Platform Generalization
- Comprehensive evaluation on five challenging benchmarks
- Robust generalization across desktop, mobile, and web environments despite training only on desktop data
Performance
Desktop Grounding Benchmarks
| Model | ScreenSpot-Pro | OSWorld-G | UI-Vision | Avg |
|---|---|---|---|---|
| Qwen2.5-VL-7B | 29.7 | 42.7 | 16.5 | 29.6 |
| UI-TARS-72B | 38.1 | 57.1 | 25.5 | 40.2 |
| GroundNext-3B | 49.8 | 64.2 | 62.1 | 58.7 |
| GroundNext-7B | 52.9 | 67.7 | 60.3 | 60.3 |
Cross-Platform Generalization
| Model | MMBench-GUI | ScreenSpot-v2 | Avg |
|---|---|---|---|
| Qwen2.5-VL-7B | 33.9 | 88.8 | 61.4 |
| UI-TARS-72B | 74.3 | 90.3 | 82.3 |
| GroundNext-3B | 77.1 | 88.5 | 82.8 |
| GroundNext-7B | 81.1 | 90.4 | 85.8 |
Performance numbers demonstrate strong cross-domain (desktop, mobile and web) generalization despite training only on desktop data.
Agentic Performance on OSWorld
GroundNext models also demonstrate strong agentic capabilities when integrated with reasoning models. When combined with OpenAI o3, GroundNext-3B achieves competitive performance on OSWorld, matching or exceeding much larger models.
| Model | OS | Office | Daily | Pro | Workflow | Overall |
|---|---|---|---|---|---|---|
| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
| OpenCUA-7B | 41.7 | 22.5 | 35.4 | 46.3 | 9.8 | 26.5 |
| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
| JEDI-7B w/ o3 | 50.0 | 46.1 | 61.9 | 75.5 | 35.3 | 51.0 |
| GroundNext-3B w/ o3 (ours) | 62.5 | 47.0 | 55.0 | 73.5 | 36.5 | 50.6 |
Task categories: OS (operating system tasks), Office (productivity applications), Daily (common user tasks), Pro (professional software), Workflow (multi-step workflows).
Key Results
- Data Efficiency: Achieves SOTA with only 700K training examples vs 9M+ in prior work
- Cross-Domain Excellence: Strong performance across desktop, mobile, and web despite desktop-only training
- Fine-Grained Grounding: Superior performance on small UI elements and complex workflows
🚀 Quick Start
Installation & Setup
Option 1: Install from PyPI (Recommended)
# Create and activate environment
conda create -n groundcua python=3.10 -y
conda activate groundcua
pip install --upgrade pip
# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision
# Install GroundCUA package
pip install groundcua
# Install Flash Attention (recommended for faster inference)
pip install flash-attn --no-build-isolation
Option 2: Install from Source
# Create and activate environment
conda create -n groundcua python=3.10 -y
conda activate groundcua
pip install --upgrade pip
# Clone repository
git clone https://github.com/ServiceNow/GroundCUA.git
cd GroundCUA
# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision
# Install in development mode
pip install -e .
# Install Flash Attention (recommended for faster inference)
pip install flash-attn --no-build-isolation
Quick GroundNext Model Inference
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen
model_name = "ServiceNow/GroundNext-7B-V0"
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True
# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)
# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
🎓 Training
training/ folder soon. Stay tuned!
Dataset
GroundCUA Dataset Overview
GroundCUA is a large-scale, human-annotated desktop grounding dataset with dense supervision:
- 📊 Scale: 56K annotated screenshots, 3.56M element annotations
- 🎯 Density: Maximum annotation density covering almost every visible UI element
- ✅ Quality: Human-verified annotations from trained experts
- 🖥️ Coverage: 87 desktop applications across 12 categories
- 📐 Resolution: High-resolution images (500K to 7M pixels)
- 🏷️ Categories: Fine-grained category information for 50% of elements
Dataset Access
Download the GroundCUA dataset:
pip install -U huggingface_hub
huggingface-cli download ServiceNow/GroundCUA --repo-type dataset --local-dir ./GroundCUA
📊 Evaluation
Supported Benchmarks
- ScreenSpot-Pro: Desktop element grounding
- ScreenSpot-v2: Web and mobile interface grounding
- MMBench-GUI: GUI understanding tasks
- OSWorld-G: Operating system grounding
- UI-Vision: Diverse desktop application grounding
Running Evaluations
cd eval/
# Evaluate on specific benchmark
python eval.py \
--model_type qwen25vl \
--model_name_or_path /path/to/trained/model \
--benchmark screenspot \
--data_path /path/to/benchmark/data \
--output_dir results/
# Evaluate on all benchmarks
python eval.py \
--model_type qwen25vl \
--model_name_or_path /path/to/trained/model \
--benchmark all \
--task all \
--language en
Evaluation Metrics
- Accuracy: Precision of GUI element localization
- Success Rate: Percentage of correctly grounded elements
- Cross-Domain Performance: Generalization to unseen platforms
- Fine-Grained Performance: Accuracy on small UI elements
Project Structure
GroundCUA/
├── README.md # This file
├── pyproject.toml # Package configuration
├── PUBLISHING.md # Guide for publishing to PyPI
├── assets/ # Images and resources
├── groundcua/ # Main package (pip installable)
│ ├── __init__.py # Package initialization and utilities
│ └── version.py # Version information
├── eval/ # Evaluation framework
│ ├── eval.py # Main evaluation script
│ ├── data.py # Data loading utilities
│ ├── prompts.py # Prompt processing
│ └── models/ # Model implementations
└── training/ # Training pipeline (documentation coming soon)
Acknowledgements
We thank the following projects and teams for their contributions to the open-source community:
- InfiGUI-G1 for the evaluation framework foundation
- LLaMA-Factory for the excellent SFT training framework
- verl for the robust RL infrastructure
- Qwen-2.5-VL for the foundation vision-language models
- The computer use and GUI automation research community
Research Use and Disclaimer
GroundCUA is intended for research and educational purposes only.
Prohibited Uses
- The model, dataset, and code may not be used for any purpose that violates applicable laws or regulations
- Use for illegal, unethical, or harmful activities is strictly prohibited
Disclaimer
- The authors and contributors are not responsible for any illegal, unethical, or harmful use
- Users are solely responsible for ensuring compliance with applicable laws and regulations
Citation
If you use GroundCUA in your research, please cite our work:
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file groundcua-0.0.1rc1.tar.gz.
File metadata
- Download URL: groundcua-0.0.1rc1.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59160e0d0fb995da0af311747ce250f637d3f43776200030b645b3d32dbf5948
|
|
| MD5 |
4327faac87d99de2c69f2ef3e9346bf5
|
|
| BLAKE2b-256 |
23abca93a114ddf201d181cb50e66180a89593925591702d679e575e2fd8943f
|
Provenance
The following attestation bundles were made for groundcua-0.0.1rc1.tar.gz:
Publisher:
publish-python.yaml on xhluca/GroundCUA
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
groundcua-0.0.1rc1.tar.gz -
Subject digest:
59160e0d0fb995da0af311747ce250f637d3f43776200030b645b3d32dbf5948 - Sigstore transparency entry: 700011429
- Sigstore integration time:
-
Permalink:
xhluca/GroundCUA@cde61dc402f7af944652c8a08181b8ebff660bdc -
Branch / Tag:
refs/tags/0.0.1rc1 - Owner: https://github.com/xhluca
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yaml@cde61dc402f7af944652c8a08181b8ebff660bdc -
Trigger Event:
release
-
Statement type:
File details
Details for the file groundcua-0.0.1rc1-py3-none-any.whl.
File metadata
- Download URL: groundcua-0.0.1rc1-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52f1c2ca3b91e8e6243f7acc8079f3050f25d3563bdf53ddd457a46f2ccb8943
|
|
| MD5 |
a45f680a0f3e110282fc49dc1a548b2b
|
|
| BLAKE2b-256 |
d4e7606c1a13d16c8ed0f1f3392d424eb2249ee13e6a17efd268fb5bf3f36386
|
Provenance
The following attestation bundles were made for groundcua-0.0.1rc1-py3-none-any.whl:
Publisher:
publish-python.yaml on xhluca/GroundCUA
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
groundcua-0.0.1rc1-py3-none-any.whl -
Subject digest:
52f1c2ca3b91e8e6243f7acc8079f3050f25d3563bdf53ddd457a46f2ccb8943 - Sigstore transparency entry: 700011437
- Sigstore integration time:
-
Permalink:
xhluca/GroundCUA@cde61dc402f7af944652c8a08181b8ebff660bdc -
Branch / Tag:
refs/tags/0.0.1rc1 - Owner: https://github.com/xhluca
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yaml@cde61dc402f7af944652c8a08181b8ebff660bdc -
Trigger Event:
release
-
Statement type: