Helper utilities and constants for GroundNext models - Computer Use Agents for grounding tasks

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aarashfeizi xhluca

These details have not been verified by PyPI

Project description

GroundCUA: Grounding Computer Use Agents on Human Demonstrations

🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models

Authors

Aarash Feizi^1,2,4*, Shravan Nayak^1,3*,
Xiangru Jian⁵, Kevin Qinghong Lin⁶, Kaixin Li⁶, Rabiul Awal^1,3,4, Xing Han Lù^1,2, Johan Obando-Ceron^1,3, Juan A. Rodriguez^1,8, Nicolas Chapados⁴, David Vazquez⁴, Adriana Romero-Soriano^1,2, Reihaneh Rabbany^1,2,
Perouz Taslakian⁴, Christopher Pal⁴, Spandana Gella⁴, Sai Rajeswar^4,1,3

¹Mila - Quebec AI Institute, ²McGill University, ³Université de Montréal,
⁴ServiceNow Research, ⁵University of Waterloo, ⁶National University of Singapore,
⁷Polytechnique Montréal, ⁸École de Technologie Supérieure, ⁹CIFAR AI Chair

^*Equal contribution

Updates

[Nov 11 2025] 🎉 We released our project webpage, the GroundCUA dataset, and the GroundNext-7B model!

Introduction

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. We address this gap through:

GroundCUA Dataset: A large-scale, human-annotated desktop grounding dataset with 56K screenshots from over 10,000 real-world human tasks across 87 applications and 3.56M+ human-verified annotations
GroundNext Models: Vision-language models at 3B and 7B scales achieving state-of-the-art results across five benchmarks
Efficient Training: SOTA performance using one-tenth the training data of prior work

Key Features

🎯 High-Quality Desktop Dataset

Dense, expert-annotated screenshots with maximum annotation density
Coverage of almost every visible element, including small icons and controls
Fine-grained category information (menus, sidebars, etc.) for 50% of UI elements—fully open-source!

⚡ Efficient Model Training

State-of-the-art performance with 700K datapoints vs 9M+ in prior work
Two-stage training: supervised fine-tuning + reinforcement learning with fully open-source code
Models at 3B and 7B scales for efficiency and accuracy

🌐 Cross-Platform Generalization

Comprehensive evaluation on five challenging benchmarks
Robust generalization across desktop, mobile, and web environments despite training only on desktop data

Performance

Desktop Grounding Benchmarks

Model	ScreenSpot-Pro	OSWorld-G	UI-Vision	Avg
Qwen2.5-VL-7B	29.7	42.7	16.5	29.6
UI-TARS-72B	38.1	57.1	25.5	40.2
GroundNext-3B	49.8	64.2	62.1	58.7
GroundNext-7B	52.9	67.7	60.3	60.3

Cross-Platform Generalization

Model	MMBench-GUI	ScreenSpot-v2	Avg
Qwen2.5-VL-7B	33.9	88.8	61.4
UI-TARS-72B	74.3	90.3	82.3
GroundNext-3B	77.1	88.5	82.8
GroundNext-7B	81.1	90.4	85.8

Performance numbers demonstrate strong cross-domain (desktop, mobile and web) generalization despite training only on desktop data.

Agentic Performance on OSWorld

GroundNext models also demonstrate strong agentic capabilities when integrated with reasoning models. When combined with OpenAI o3, GroundNext-3B achieves competitive performance on OSWorld, matching or exceeding much larger models.

Model	OS	Office	Daily	Pro	Workflow	Overall
OpenAI o3	62.5	14.5	21.4	38.8	16.5	23.0
CUA	23.9	34.6	55.1	18.3	18.3	31.4
OpenCUA-7B	41.7	22.5	35.4	46.3	9.8	26.5
OpenCUA-72B	58.3	47.0	53.8	73.5	20.4	46.1
UI-TARS-1.5-7B	33.3	29.9	37.9	53.1	9.1	29.6
JEDI-7B w/ o3	50.0	46.1	61.9	75.5	35.3	51.0
GroundNext-3B w/ o3 (ours)	62.5	47.0	55.0	73.5	36.5	50.6

Task categories: OS (operating system tasks), Office (productivity applications), Daily (common user tasks), Pro (professional software), Workflow (multi-step workflows).

Key Results

Data Efficiency: Achieves SOTA with only 700K training examples vs 9M+ in prior work
Cross-Domain Excellence: Strong performance across desktop, mobile, and web despite desktop-only training
Fine-Grained Grounding: Superior performance on small UI elements and complex workflows

🚀 Quick Start

Installation & Setup

Option 1: Install from PyPI (Recommended)

# Create and activate environment
conda create -n groundcua python=3.10 -y
conda activate groundcua

pip install --upgrade pip

# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision

# Install GroundCUA package
pip install groundcua

# Install Flash Attention (recommended for faster inference)
pip install flash-attn --no-build-isolation

Option 2: Install from Source

# Create and activate environment
conda create -n groundcua python=3.10 -y
conda activate groundcua

pip install --upgrade pip

# Clone repository
git clone https://github.com/ServiceNow/GroundCUA.git
cd GroundCUA

# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision

# Install in development mode
pip install -e .

# Install Flash Attention (recommended for faster inference)
pip install flash-attn --no-build-isolation

Quick GroundNext Model Inference

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen

model_name = "ServiceNow/GroundNext-7B-V0"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True

# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)

# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>

🎓 Training

🚧 Coming Soon: We are currently refining the training documentation and code. Complete training instructions, including supervised fine-tuning and reinforcement learning recipes, will be released in the training/ folder soon. Stay tuned!

Dataset

GroundCUA Dataset Overview

GroundCUA is a large-scale, human-annotated desktop grounding dataset with dense supervision:

📊 Scale: 56K annotated screenshots, 3.56M element annotations
🎯 Density: Maximum annotation density covering almost every visible UI element
✅ Quality: Human-verified annotations from trained experts
🖥️ Coverage: 87 desktop applications across 12 categories
📐 Resolution: High-resolution images (500K to 7M pixels)
🏷️ Categories: Fine-grained category information for 50% of elements

Dataset Access

Download the GroundCUA dataset:

pip install -U huggingface_hub
huggingface-cli download ServiceNow/GroundCUA --repo-type dataset --local-dir ./GroundCUA

📊 Evaluation

Our evaluation framework builds upon InfiGUI-G1 and provides comprehensive evaluation across multiple benchmarks.

Supported Benchmarks

ScreenSpot-Pro: Desktop element grounding
ScreenSpot-v2: Web and mobile interface grounding
MMBench-GUI: GUI understanding tasks
OSWorld-G: Operating system grounding
UI-Vision: Diverse desktop application grounding

Running Evaluations

cd eval/

# Evaluate on specific benchmark
python eval.py \
    --model_type qwen25vl \
    --model_name_or_path /path/to/trained/model \
    --benchmark screenspot \
    --data_path /path/to/benchmark/data \
    --output_dir results/

# Evaluate on all benchmarks
python eval.py \
    --model_type qwen25vl \
    --model_name_or_path /path/to/trained/model \
    --benchmark all \
    --task all \
    --language en

Evaluation Metrics

Accuracy: Precision of GUI element localization
Success Rate: Percentage of correctly grounded elements
Cross-Domain Performance: Generalization to unseen platforms
Fine-Grained Performance: Accuracy on small UI elements

Project Structure

GroundCUA/
├── README.md                    # This file
├── pyproject.toml              # Package configuration
├── PUBLISHING.md               # Guide for publishing to PyPI
├── assets/                      # Images and resources
├── groundcua/                  # Main package (pip installable)
│   ├── __init__.py             # Package initialization and utilities
│   └── version.py              # Version information
├── eval/                        # Evaluation framework
│   ├── eval.py                 # Main evaluation script
│   ├── data.py                 # Data loading utilities
│   ├── prompts.py              # Prompt processing
│   └── models/                 # Model implementations
└── training/                   # Training pipeline (documentation coming soon)

Acknowledgements

We thank the following projects and teams for their contributions to the open-source community:

InfiGUI-G1 for the evaluation framework foundation
LLaMA-Factory for the excellent SFT training framework
verl for the robust RL infrastructure
Qwen-2.5-VL for the foundation vision-language models
The computer use and GUI automation research community

Research Use and Disclaimer

GroundCUA is intended for research and educational purposes only.

Prohibited Uses

The model, dataset, and code may not be used for any purpose that violates applicable laws or regulations
Use for illegal, unethical, or harmful activities is strictly prohibited

Disclaimer

The authors and contributors are not responsible for any illegal, unethical, or harmful use
Users are solely responsible for ensuring compliance with applicable laws and regulations

Citation

If you use GroundCUA in your research, please cite our work:

@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}, 
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aarashfeizi xhluca

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.1rc1 pre-release

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

groundcua-0.0.1rc1.tar.gz (11.7 kB view details)

Uploaded Nov 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

groundcua-0.0.1rc1-py3-none-any.whl (10.0 kB view details)

Uploaded Nov 13, 2025 Python 3

File details

Details for the file groundcua-0.0.1rc1.tar.gz.

File metadata

Download URL: groundcua-0.0.1rc1.tar.gz
Upload date: Nov 13, 2025
Size: 11.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for groundcua-0.0.1rc1.tar.gz
Algorithm	Hash digest
SHA256	`59160e0d0fb995da0af311747ce250f637d3f43776200030b645b3d32dbf5948`
MD5	`4327faac87d99de2c69f2ef3e9346bf5`
BLAKE2b-256	`23abca93a114ddf201d181cb50e66180a89593925591702d679e575e2fd8943f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for groundcua-0.0.1rc1.tar.gz:

Publisher: publish-python.yaml on xhluca/GroundCUA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: groundcua-0.0.1rc1.tar.gz
- Subject digest: 59160e0d0fb995da0af311747ce250f637d3f43776200030b645b3d32dbf5948
- Sigstore transparency entry: 700011429
- Sigstore integration time: Nov 13, 2025
Source repository:
- Permalink: xhluca/GroundCUA@cde61dc402f7af944652c8a08181b8ebff660bdc
- Branch / Tag: refs/tags/0.0.1rc1
- Owner: https://github.com/xhluca
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-python.yaml@cde61dc402f7af944652c8a08181b8ebff660bdc
- Trigger Event: release

File details

Details for the file groundcua-0.0.1rc1-py3-none-any.whl.

File metadata

Download URL: groundcua-0.0.1rc1-py3-none-any.whl
Upload date: Nov 13, 2025
Size: 10.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for groundcua-0.0.1rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52f1c2ca3b91e8e6243f7acc8079f3050f25d3563bdf53ddd457a46f2ccb8943`
MD5	`a45f680a0f3e110282fc49dc1a548b2b`
BLAKE2b-256	`d4e7606c1a13d16c8ed0f1f3392d424eb2249ee13e6a17efd268fb5bf3f36386`

See more details on using hashes here.

Provenance

The following attestation bundles were made for groundcua-0.0.1rc1-py3-none-any.whl:

Publisher: publish-python.yaml on xhluca/GroundCUA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: groundcua-0.0.1rc1-py3-none-any.whl
- Subject digest: 52f1c2ca3b91e8e6243f7acc8079f3050f25d3563bdf53ddd457a46f2ccb8943
- Sigstore transparency entry: 700011437
- Sigstore integration time: Nov 13, 2025
Source repository:
- Permalink: xhluca/GroundCUA@cde61dc402f7af944652c8a08181b8ebff660bdc
- Branch / Tag: refs/tags/0.0.1rc1
- Owner: https://github.com/xhluca
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-python.yaml@cde61dc402f7af944652c8a08181b8ebff660bdc
- Trigger Event: release

groundcua 0.0.1rc1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GroundCUA: Grounding Computer Use Agents on Human Demonstrations

Authors

Updates

Introduction

Key Features

Performance

Desktop Grounding Benchmarks

Cross-Platform Generalization

Agentic Performance on OSWorld

Key Results

🚀 Quick Start

Installation & Setup

Option 1: Install from PyPI (Recommended)

Option 2: Install from Source

Quick GroundNext Model Inference

🎓 Training

Dataset

GroundCUA Dataset Overview

Dataset Access

📊 Evaluation

Supported Benchmarks

Running Evaluations

Evaluation Metrics

Project Structure

Acknowledgements

Research Use and Disclaimer

Prohibited Uses

Disclaimer

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance