Robust UI element localization for automation
Project description
OpenAdapt Grounding
Robust UI element localization for automation.
Turn flakey single-frame detections into stable, reliable element coordinates.
The Problem
Vision models like OmniParser miss elements randomly frame-to-frame ("flickering"). Template matching breaks with resolution/theme changes.
Left: Raw detections showing frame-to-frame flickering
The Solution
- Temporal Smoothing: Aggregate detections across frames, keep only stable elements
- Text Anchoring: Match elements by OCR text (resolution-independent)
Side-by-side: Raw flickering (left) vs Stabilized detection (right)
Results
Detection Stability
| Metric | Raw (30% dropout) | Stabilized |
|---|---|---|
| Avg Detection Rate | ~60-70% | 80-100% |
| Min Detection Rate | ~40% | Consistent |
| Consistency | Flickering | Stable |
Resolution Robustness
| Scale | Resolution | Elements Found | Status |
|---|---|---|---|
| 1.0x | 800x600 | All | ✓ |
| 1.25x | 1000x750 | All | ✓ |
| 1.5x | 1200x900 | All | ✓ |
| 2.0x | 1600x1200 | All | ✓ |
Visual Output
Stable elements after temporal filtering
Quick Start
uv pip install openadapt-grounding
Build a Registry (Offline)
from openadapt_grounding import RegistryBuilder, Element
# Add detections from multiple frames
builder = RegistryBuilder()
builder.add_frame([
Element(bounds=(0.3, 0.2, 0.2, 0.05), text="Login"),
Element(bounds=(0.3, 0.3, 0.2, 0.05), text="Cancel"),
])
# ... add more frames
# Build registry (keeps elements in >50% of frames)
registry = builder.build(min_stability=0.5)
registry.save("elements.json")
Locate Elements (Runtime)
from openadapt_grounding import ElementLocator
from PIL import Image
locator = ElementLocator("elements.json")
screenshot = Image.open("current_screen.png")
result = locator.find("Login", screenshot)
if result.found:
# Normalized coordinates (0-1)
print(f"Found at ({result.x:.2f}, {result.y:.2f})")
# Convert to pixels
px, py = result.to_pixels(width=1920, height=1080)
print(f"Click at ({px}, {py})")
Run Demo
uv run python -m openadapt_grounding.demo
Output:
============================================================
OpenAdapt Grounding Demo Results
============================================================
Registry: 5 stable elements
📊 Detection Stability:
Raw (with 30% dropout): 70%
Stabilized (filtered): 100%
Improvement: +30%
📐 Resolution Robustness:
✓ 1.0x (800x600): 5 elements
✓ 1.25x (1000x750): 5 elements
✓ 1.5x (1200x900): 5 elements
✓ 2.0x (1600x1200): 5 elements
📁 Outputs saved to: demo_output/
How It Works
Temporal Clustering
Frame 1: [Login ✓] [Cancel ✓] [Password ✗] → 2/3 detected
Frame 2: [Login ✓] [Cancel ✗] [Password ✓] → 2/3 detected
Frame 3: [Login ✓] [Cancel ✓] [Password ✓] → 3/3 detected
...
After 10 frames:
- "Login" seen 9/10 times → KEEP (90% stability)
- "Cancel" seen 7/10 times → KEEP (70% stability)
- "Password" seen 8/10 times → KEEP (80% stability)
Text-Based Matching
At runtime, we use OCR to find text on screen, then match against the registry:
# Registry knows "Login" button exists
# OCR finds "Login" text at (0.45, 0.35)
# → Return those coordinates with high confidence
OmniParser Integration
Use with OmniParser for real UI element detection:
Deploy OmniParser Server
# Install deploy dependencies
uv pip install openadapt-grounding[deploy]
# Set AWS credentials (or use .env file)
cp .env.example .env
# Edit .env with your AWS credentials
# Deploy to EC2 (g6.xlarge with L4 GPU)
uv run python -m openadapt_grounding.deploy start
# Stop when done (terminates instance)
uv run python -m openadapt_grounding.deploy stop
Monitor Deployment
# Check instance and server status
$ uv run python -m openadapt_grounding.deploy status
Instance: i-0f57529053cb507ca | State: running | URL: http://98.92.234.13:8000
Auto-shutdown: Enabled (60 min timeout)
# Show container status
$ uv run python -m openadapt_grounding.deploy ps
CONTAINER ID IMAGE CREATED STATUS PORTS NAMES
c9343a65e85b omniparser:latest 2 hours ago Up 2 hours 0.0.0.0:8000->8000/tcp omniparser-container
# View container logs
$ uv run python -m openadapt_grounding.deploy logs --lines=5
INFO: 99.230.67.57:61252 - "POST /parse/ HTTP/1.1" 200 OK
start parsing...
image size: (1200, 779)
len(filtered_boxes): 160 124
time: 4.438266754150391
# Test endpoint with synthetic image
$ uv run python -m openadapt_grounding.deploy test
Server is healthy!
Sending test image to server...
Found 5 elements:
- [text] "Login" at ['0.08', '0.10', '0.38', '0.23']
- [text] "Cancel" at ['0.08', '0.30', '0.38', '0.43']
...
Other Commands
uv run python -m openadapt_grounding.deploy build # Rebuild Docker image
uv run python -m openadapt_grounding.deploy run # Start container
uv run python -m openadapt_grounding.deploy ssh # SSH into instance
Test Results
Real screenshot parsed by OmniParser:
| Input | Output (160 elements detected) |
|---|---|
Synthetic UI test:
| Input | Output |
|---|---|
# Run test with synthetic UI
uv run python -m openadapt_grounding.deploy test --save_output
Use OmniParser with Temporal Smoothing
from openadapt_grounding import OmniParserClient, collect_frames
from PIL import Image
# Connect to deployed server
client = OmniParserClient("http://<server-ip>:8000")
# Take a screenshot
screenshot = Image.open("screen.png")
# Run parser 10 times, keep elements in >50% of frames
registry = collect_frames(client, screenshot, num_frames=10, min_stability=0.5)
registry.save("stable_elements.json")
print(f"Found {len(registry)} stable elements")
Analyze Detection Stability
from openadapt_grounding import OmniParserClient, analyze_stability
client = OmniParserClient("http://<server-ip>:8000")
stats = analyze_stability(client, screenshot, num_frames=10)
print(f"Average stability: {stats['avg_stability']:.0%}")
for elem in stats['elements']:
print(f" {elem['text']}: {elem['stability']:.0%}")
UI-TARS Integration
UI-TARS 1.5 is ByteDance's SOTA UI grounding model (61.6% on ScreenSpot-Pro). Use it for direct element localization by instruction.
Deploy UI-TARS Server
# Install dependencies
uv pip install openadapt-grounding[deploy,uitars]
# Deploy to EC2 (g6.2xlarge with L4 GPU)
uv run python -m openadapt_grounding.deploy.uitars start
# Check status
uv run python -m openadapt_grounding.deploy.uitars status
# Test grounding
uv run python -m openadapt_grounding.deploy.uitars test
# Stop when done
uv run python -m openadapt_grounding.deploy.uitars stop
Use UI-TARS for Grounding
from openadapt_grounding import UITarsClient
from PIL import Image
# Connect to deployed server
client = UITarsClient("http://<server-ip>:8001/v1")
# Load screenshot
screenshot = Image.open("screen.png")
# Ground element by instruction
result = client.ground(screenshot, "Click on the Login button")
if result.found:
# Normalized coordinates (0-1)
print(f"Found at ({result.x:.2f}, {result.y:.2f})")
# Convert to pixels
px, py = result.to_pixels(width=1920, height=1080)
print(f"Click at ({px}, {py})")
# Optional: View model's reasoning
if result.thought:
print(f"Thought: {result.thought}")
OmniParser vs UI-TARS
| Feature | OmniParser | UI-TARS |
|---|---|---|
| Approach | Parse all elements | Ground by query |
| Output | List of bboxes | Single click point |
| Best for | Enumeration, registry building | Direct element finding |
| Detection Rate (our benchmark) | 99.3% | 70.6% |
| Latency (per element) | ~1.4s | ~6.9s |
Evaluation & Benchmarking
We provide a comprehensive evaluation framework to compare UI grounding methods.
Benchmark Results
Evaluated on synthetic dataset (100 samples, 1922 UI elements):
| Method | Detection Rate | IoU | Latency | Attempts |
|---|---|---|---|---|
| OmniParser + screenseeker | 99.3% | 0.690 | 1418ms | 2.0 |
| OmniParser + fixed | 98.1% | 0.681 | 1486ms | 2.2 |
| OmniParser baseline | 97.4% | 0.648 | 724ms | 1.0 |
| UI-TARS + screenseeker | 70.6% | - | 6914ms | 2.3 |
| UI-TARS + fixed | 66.9% | - | 6891ms | 2.4 |
| UI-TARS baseline | 36.1% | - | 2724ms | 1.0 |
On Harder Synthetic Data (48 samples, 1035 elements - more dense, smaller targets):
| Method | Detection Rate | Change from Standard |
|---|---|---|
| OmniParser + fixed | 98.2% | +0.1% |
| OmniParser + screenseeker | 96.6% | -2.7% |
| OmniParser baseline | 90.1% | -7.3% |
Key Findings:
- Cropping strategies dramatically improve UI-TARS accuracy (+96% with screenseeker) but have minimal effect on OmniParser on standard data
- On harder data, cropping becomes essential - OmniParser baseline drops 7.3% but fixed cropping maintains 98%+ accuracy
- OmniParser is 3.8-5x faster than UI-TARS while being significantly more accurate
Detection Rate by Method
Detection Rate by Element Size
Small elements (<32px) are hardest for UI-TARS (28.6% → 50% with cropping), while OmniParser maintains ~100% across all sizes.
Accuracy vs Latency Tradeoff
OmniParser offers the best accuracy-latency tradeoff, with near-perfect detection at <1.5s per element.
Synthetic Dataset Samples
The evaluation uses programmatically generated UI screenshots with ground truth:
| Easy (3-8 elements) | Hard (20-50 elements, dark theme) |
|---|---|
Run Your Own Evaluation
# Install dependencies
uv pip install openadapt-grounding[eval]
# Generate synthetic dataset
uv run python -m openadapt_grounding.eval generate --type synthetic --count 100
# Run evaluation (requires deployed servers)
uv run python -m openadapt_grounding.eval run --method omniparser --dataset synthetic
uv run python -m openadapt_grounding.eval run --method uitars --dataset synthetic
# With cropping strategies
uv run python -m openadapt_grounding.eval run --method omniparser-screenseeker --dataset synthetic
uv run python -m openadapt_grounding.eval run --method uitars-screenseeker --dataset synthetic
# Generate comparison charts
uv run python -m openadapt_grounding.eval compare --charts-dir evaluation/charts
Available Methods
| Method | Description |
|---|---|
omniparser |
OmniParser baseline (full image) |
omniparser-fixed |
OmniParser + fixed cropping (200, 300, 500px) |
omniparser-screenseeker |
OmniParser + heuristic UI region cropping |
uitars |
UI-TARS baseline (full image) |
uitars-fixed |
UI-TARS + fixed cropping |
uitars-screenseeker |
UI-TARS + heuristic UI region cropping |
See Evaluation Documentation for methodology and metrics.
API
RegistryBuilder
add_frame(elements)- Add a frame's detectionsbuild(min_stability=0.5)- Build registry, filtering unstable elements
ElementLocator
find(query, screenshot)- Find element by textfind_by_uid(uid, screenshot)- Find element by registry UID
LocatorResult
found: bool- Whether element was foundx, y: float- Normalized coordinates (0-1)confidence: float- Match confidenceto_pixels(w, h)- Convert to pixel coordinates
OmniParserClient
is_available()- Check if server is runningparse(image)- Parse screenshot, return elementsparse_with_metadata(image)- Parse with latency info
UITarsClient
is_available()- Check if server is runningground(image, instruction)- Find element by instruction, returnGroundingResult
GroundingResult
found: bool- Whether element was foundx, y: float- Normalized coordinates (0-1)confidence: float- Match confidencethought: str- Model's reasoning (if include_thought=True)to_pixels(w, h)- Convert to pixel coordinates
collect_frames(parser, image, num_frames, min_stability)
- Run parser multiple times, build stable registry
analyze_stability(parser, image, num_frames)
- Report per-element detection stability
Documentation
| Document | Description |
|---|---|
| Evaluation Findings | Analysis of why OmniParser outperforms UI-TARS on our task |
| Literature Review | SOTA analysis: UI-TARS (61.6%), OmniParser (39.6%), ScreenSeekeR cropping |
| Experiment Plan | Comparison methodology: 6 methods, 3 datasets, evaluation metrics |
| Evaluation Harness | Benchmarking framework, dataset formats, CLI usage |
| UI-TARS Deployment | UI-TARS deployment design, vLLM setup, API format |
Key Findings
From our benchmark on synthetic UI data:
- OmniParser dominates on our task: 97-99% detection vs UI-TARS's 36-70%
- Cropping becomes essential on harder data: OmniParser baseline drops to 90%, but fixed cropping maintains 98%+
- OmniParser is 3.8-5x faster than UI-TARS while being more accurate
- Literature benchmarks don't transfer directly: UI-TARS leads on ScreenSpot-Pro (complex instruction-following) but OmniParser wins on element detection
- Small elements (<32px) remain hardest for UI-TARS (28.6% baseline → 50% with cropping)
See Evaluation Findings for analysis of why results differ from literature benchmarks.
Development
git clone https://github.com/OpenAdaptAI/openadapt-grounding
cd openadapt-grounding
uv sync
uv run pytest
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openadapt_grounding-0.1.1.tar.gz.
File metadata
- Download URL: openadapt_grounding-0.1.1.tar.gz
- Upload date:
- Size: 5.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8fd3abbddb0d8a6af4de89030aae0b136139db86e8c23db27d813e4c091c4f8
|
|
| MD5 |
bd8343d235ccdc02e362c616602009fb
|
|
| BLAKE2b-256 |
7c26b01888c02f8cb55a84ed3842fb97a326d00def9fd8dafb2b76aee47f4dba
|
Provenance
The following attestation bundles were made for openadapt_grounding-0.1.1.tar.gz:
Publisher:
release.yml on OpenAdaptAI/openadapt-grounding
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openadapt_grounding-0.1.1.tar.gz -
Subject digest:
b8fd3abbddb0d8a6af4de89030aae0b136139db86e8c23db27d813e4c091c4f8 - Sigstore transparency entry: 871196492
- Sigstore integration time:
-
Permalink:
OpenAdaptAI/openadapt-grounding@db01f86bb0e62df393b93c297374325169f893c0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/OpenAdaptAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@db01f86bb0e62df393b93c297374325169f893c0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file openadapt_grounding-0.1.1-py3-none-any.whl.
File metadata
- Download URL: openadapt_grounding-0.1.1-py3-none-any.whl
- Upload date:
- Size: 89.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7704742cd850491fee2109782b613ac8db018a4af907552bd0bc225835dd00b1
|
|
| MD5 |
d7065637599adde1c20f45104a1a3ce4
|
|
| BLAKE2b-256 |
a1f4eff17ff3fb1a239891e78ec40780eaf89a9d5b5171e73721cbee03d5e7a4
|
Provenance
The following attestation bundles were made for openadapt_grounding-0.1.1-py3-none-any.whl:
Publisher:
release.yml on OpenAdaptAI/openadapt-grounding
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openadapt_grounding-0.1.1-py3-none-any.whl -
Subject digest:
7704742cd850491fee2109782b613ac8db018a4af907552bd0bc225835dd00b1 - Sigstore transparency entry: 871196495
- Sigstore integration time:
-
Permalink:
OpenAdaptAI/openadapt-grounding@db01f86bb0e62df393b93c297374325169f893c0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/OpenAdaptAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@db01f86bb0e62df393b93c297374325169f893c0 -
Trigger Event:
push
-
Statement type: