Robust UI element localization for automation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abrichr

These details have not been verified by PyPI

Project description

OpenAdapt Grounding

Robust UI element localization for automation.

Turn flakey single-frame detections into stable, reliable element coordinates.

The Problem

Vision models like OmniParser miss elements randomly frame-to-frame ("flickering"). Template matching breaks with resolution/theme changes.

Raw Flickering Detection

Left: Raw detections showing frame-to-frame flickering

The Solution

Temporal Smoothing: Aggregate detections across frames, keep only stable elements
Text Anchoring: Match elements by OCR text (resolution-independent)

Side-by-Side Comparison

Side-by-side: Raw flickering (left) vs Stabilized detection (right)

Results

Detection Stability

Metric	Raw (30% dropout)	Stabilized
Avg Detection Rate	~60-70%	80-100%
Min Detection Rate	~40%	Consistent
Consistency	Flickering	Stable

Resolution Robustness

Scale	Resolution	Elements Found	Status
1.0x	800x600	All	✓
1.25x	1000x750	All	✓
1.5x	1200x900	All	✓
2.0x	1600x1200	All	✓

Visual Output

Stabilized Detection

Stable elements after temporal filtering

Quick Start

uv pip install openadapt-grounding

Build a Registry (Offline)

from openadapt_grounding import RegistryBuilder, Element

# Add detections from multiple frames
builder = RegistryBuilder()
builder.add_frame([
    Element(bounds=(0.3, 0.2, 0.2, 0.05), text="Login"),
    Element(bounds=(0.3, 0.3, 0.2, 0.05), text="Cancel"),
])
# ... add more frames

# Build registry (keeps elements in >50% of frames)
registry = builder.build(min_stability=0.5)
registry.save("elements.json")

Locate Elements (Runtime)

from openadapt_grounding import ElementLocator
from PIL import Image

locator = ElementLocator("elements.json")
screenshot = Image.open("current_screen.png")

result = locator.find("Login", screenshot)
if result.found:
    # Normalized coordinates (0-1)
    print(f"Found at ({result.x:.2f}, {result.y:.2f})")

    # Convert to pixels
    px, py = result.to_pixels(width=1920, height=1080)
    print(f"Click at ({px}, {py})")

Run Demo

uv run python -m openadapt_grounding.demo

Output:

============================================================
OpenAdapt Grounding Demo Results
============================================================

Registry: 5 stable elements

📊 Detection Stability:
  Raw (with 30% dropout):    70%
  Stabilized (filtered):     100%
  Improvement:               +30%

📐 Resolution Robustness:
  ✓ 1.0x (800x600): 5 elements
  ✓ 1.25x (1000x750): 5 elements
  ✓ 1.5x (1200x900): 5 elements
  ✓ 2.0x (1600x1200): 5 elements

📁 Outputs saved to: demo_output/

How It Works

Temporal Clustering

Frame 1: [Login ✓] [Cancel ✓] [Password ✗]  → 2/3 detected
Frame 2: [Login ✓] [Cancel ✗] [Password ✓]  → 2/3 detected
Frame 3: [Login ✓] [Cancel ✓] [Password ✓]  → 3/3 detected
...
After 10 frames:
  - "Login" seen 9/10 times → KEEP (90% stability)
  - "Cancel" seen 7/10 times → KEEP (70% stability)
  - "Password" seen 8/10 times → KEEP (80% stability)

Text-Based Matching

At runtime, we use OCR to find text on screen, then match against the registry:

# Registry knows "Login" button exists
# OCR finds "Login" text at (0.45, 0.35)
# → Return those coordinates with high confidence

OmniParser Integration

Use with OmniParser for real UI element detection:

Deploy OmniParser Server

# Install deploy dependencies
uv pip install openadapt-grounding[deploy]

# Set AWS credentials (or use .env file)
cp .env.example .env
# Edit .env with your AWS credentials

# Deploy to EC2 (g6.xlarge with L4 GPU)
uv run python -m openadapt_grounding.deploy start

# Stop when done (terminates instance)
uv run python -m openadapt_grounding.deploy stop

Monitor Deployment

# Check instance and server status
$ uv run python -m openadapt_grounding.deploy status
Instance: i-0f57529053cb507ca | State: running | URL: http://98.92.234.13:8000
Auto-shutdown: Enabled (60 min timeout)

# Show container status
$ uv run python -m openadapt_grounding.deploy ps
CONTAINER ID   IMAGE               CREATED          STATUS          PORTS                    NAMES
c9343a65e85b   omniparser:latest   2 hours ago      Up 2 hours      0.0.0.0:8000->8000/tcp   omniparser-container

# View container logs
$ uv run python -m openadapt_grounding.deploy logs --lines=5
INFO:     99.230.67.57:61252 - "POST /parse/ HTTP/1.1" 200 OK
start parsing...
image size: (1200, 779)
len(filtered_boxes): 160 124
time: 4.438266754150391

# Test endpoint with synthetic image
$ uv run python -m openadapt_grounding.deploy test
Server is healthy!
Sending test image to server...
Found 5 elements:
  - [text] "Login" at ['0.08', '0.10', '0.38', '0.23']
  - [text] "Cancel" at ['0.08', '0.30', '0.38', '0.43']
  ...

Other Commands

uv run python -m openadapt_grounding.deploy build   # Rebuild Docker image
uv run python -m openadapt_grounding.deploy run     # Start container
uv run python -m openadapt_grounding.deploy ssh     # SSH into instance

Test Results

Real screenshot parsed by OmniParser:

Input	Output (160 elements detected)

Synthetic UI test:

Input	Output

# Run test with synthetic UI
uv run python -m openadapt_grounding.deploy test --save_output

Use OmniParser with Temporal Smoothing

from openadapt_grounding import OmniParserClient, collect_frames
from PIL import Image

# Connect to deployed server
client = OmniParserClient("http://<server-ip>:8000")

# Take a screenshot
screenshot = Image.open("screen.png")

# Run parser 10 times, keep elements in >50% of frames
registry = collect_frames(client, screenshot, num_frames=10, min_stability=0.5)
registry.save("stable_elements.json")

print(f"Found {len(registry)} stable elements")

Analyze Detection Stability

from openadapt_grounding import OmniParserClient, analyze_stability

client = OmniParserClient("http://<server-ip>:8000")
stats = analyze_stability(client, screenshot, num_frames=10)

print(f"Average stability: {stats['avg_stability']:.0%}")
for elem in stats['elements']:
    print(f"  {elem['text']}: {elem['stability']:.0%}")

UI-TARS Integration

UI-TARS 1.5 is ByteDance's SOTA UI grounding model (61.6% on ScreenSpot-Pro). Use it for direct element localization by instruction.

Deploy UI-TARS Server

# Install dependencies
uv pip install openadapt-grounding[deploy,uitars]

# Deploy to EC2 (g6.2xlarge with L4 GPU)
uv run python -m openadapt_grounding.deploy.uitars start

# Check status
uv run python -m openadapt_grounding.deploy.uitars status

# Test grounding
uv run python -m openadapt_grounding.deploy.uitars test

# Stop when done
uv run python -m openadapt_grounding.deploy.uitars stop

Use UI-TARS for Grounding

from openadapt_grounding import UITarsClient
from PIL import Image

# Connect to deployed server
client = UITarsClient("http://<server-ip>:8001/v1")

# Load screenshot
screenshot = Image.open("screen.png")

# Ground element by instruction
result = client.ground(screenshot, "Click on the Login button")

if result.found:
    # Normalized coordinates (0-1)
    print(f"Found at ({result.x:.2f}, {result.y:.2f})")

    # Convert to pixels
    px, py = result.to_pixels(width=1920, height=1080)
    print(f"Click at ({px}, {py})")

    # Optional: View model's reasoning
    if result.thought:
        print(f"Thought: {result.thought}")

OmniParser vs UI-TARS

Feature	OmniParser	UI-TARS
Approach	Parse all elements	Ground by query
Output	List of bboxes	Single click point
Best for	Enumeration, registry building	Direct element finding
Detection Rate (our benchmark)	99.3%	70.6%
Latency (per element)	~1.4s	~6.9s

Evaluation & Benchmarking

We provide a comprehensive evaluation framework to compare UI grounding methods.

Benchmark Results

Evaluated on synthetic dataset (100 samples, 1922 UI elements):

Method	Detection Rate	IoU	Latency	Attempts
OmniParser + screenseeker	99.3%	0.690	1418ms	2.0
OmniParser + fixed	98.1%	0.681	1486ms	2.2
OmniParser baseline	97.4%	0.648	724ms	1.0
UI-TARS + screenseeker	70.6%	-	6914ms	2.3
UI-TARS + fixed	66.9%	-	6891ms	2.4
UI-TARS baseline	36.1%	-	2724ms	1.0

On Harder Synthetic Data (48 samples, 1035 elements - more dense, smaller targets):

Method	Detection Rate	Change from Standard
OmniParser + fixed	98.2%	+0.1%
OmniParser + screenseeker	96.6%	-2.7%
OmniParser baseline	90.1%	-7.3%

Key Findings:

Cropping strategies dramatically improve UI-TARS accuracy (+96% with screenseeker) but have minimal effect on OmniParser on standard data
On harder data, cropping becomes essential - OmniParser baseline drops 7.3% but fixed cropping maintains 98%+ accuracy
OmniParser is 3.8-5x faster than UI-TARS while being significantly more accurate

Detection Rate by Method

Detection Rate by Element Size

Detection Rate by Size

Small elements (<32px) are hardest for UI-TARS (28.6% → 50% with cropping), while OmniParser maintains ~100% across all sizes.

Accuracy vs Latency Tradeoff

Accuracy vs Latency

OmniParser offers the best accuracy-latency tradeoff, with near-perfect detection at <1.5s per element.

Synthetic Dataset Samples

The evaluation uses programmatically generated UI screenshots with ground truth:

Easy (3-8 elements)	Hard (20-50 elements, dark theme)

Run Your Own Evaluation

# Install dependencies
uv pip install openadapt-grounding[eval]

# Generate synthetic dataset
uv run python -m openadapt_grounding.eval generate --type synthetic --count 100

# Run evaluation (requires deployed servers)
uv run python -m openadapt_grounding.eval run --method omniparser --dataset synthetic
uv run python -m openadapt_grounding.eval run --method uitars --dataset synthetic

# With cropping strategies
uv run python -m openadapt_grounding.eval run --method omniparser-screenseeker --dataset synthetic
uv run python -m openadapt_grounding.eval run --method uitars-screenseeker --dataset synthetic

# Generate comparison charts
uv run python -m openadapt_grounding.eval compare --charts-dir evaluation/charts

Available Methods

Method	Description
`omniparser`	OmniParser baseline (full image)
`omniparser-fixed`	OmniParser + fixed cropping (200, 300, 500px)
`omniparser-screenseeker`	OmniParser + heuristic UI region cropping
`uitars`	UI-TARS baseline (full image)
`uitars-fixed`	UI-TARS + fixed cropping
`uitars-screenseeker`	UI-TARS + heuristic UI region cropping

See Evaluation Documentation for methodology and metrics.

API

`RegistryBuilder`

add_frame(elements) - Add a frame's detections
build(min_stability=0.5) - Build registry, filtering unstable elements

`ElementLocator`

find(query, screenshot) - Find element by text
find_by_uid(uid, screenshot) - Find element by registry UID

`LocatorResult`

found: bool - Whether element was found
x, y: float - Normalized coordinates (0-1)
confidence: float - Match confidence
to_pixels(w, h) - Convert to pixel coordinates

`OmniParserClient`

is_available() - Check if server is running
parse(image) - Parse screenshot, return elements
parse_with_metadata(image) - Parse with latency info

`UITarsClient`

is_available() - Check if server is running
ground(image, instruction) - Find element by instruction, return GroundingResult

`GroundingResult`

found: bool - Whether element was found
x, y: float - Normalized coordinates (0-1)
confidence: float - Match confidence
thought: str - Model's reasoning (if include_thought=True)
to_pixels(w, h) - Convert to pixel coordinates

`collect_frames(parser, image, num_frames, min_stability)`

Run parser multiple times, build stable registry

`analyze_stability(parser, image, num_frames)`

Report per-element detection stability

Documentation

Document	Description
Evaluation Findings	Analysis of why OmniParser outperforms UI-TARS on our task
Literature Review	SOTA analysis: UI-TARS (61.6%), OmniParser (39.6%), ScreenSeekeR cropping
Experiment Plan	Comparison methodology: 6 methods, 3 datasets, evaluation metrics
Evaluation Harness	Benchmarking framework, dataset formats, CLI usage
UI-TARS Deployment	UI-TARS deployment design, vLLM setup, API format

Key Findings

From our benchmark on synthetic UI data:

OmniParser dominates on our task: 97-99% detection vs UI-TARS's 36-70%
Cropping becomes essential on harder data: OmniParser baseline drops to 90%, but fixed cropping maintains 98%+
OmniParser is 3.8-5x faster than UI-TARS while being more accurate
Literature benchmarks don't transfer directly: UI-TARS leads on ScreenSpot-Pro (complex instruction-following) but OmniParser wins on element detection
Small elements (<32px) remain hardest for UI-TARS (28.6% baseline → 50% with cropping)

See Evaluation Findings for analysis of why results differ from literature benchmarks.

Development

git clone https://github.com/OpenAdaptAI/openadapt-grounding
cd openadapt-grounding
uv sync
uv run pytest

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abrichr

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jan 29, 2026

This version

0.1.1

Jan 29, 2026

0.1.0

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openadapt_grounding-0.1.1.tar.gz (5.3 MB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openadapt_grounding-0.1.1-py3-none-any.whl (89.4 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file openadapt_grounding-0.1.1.tar.gz.

File metadata

Download URL: openadapt_grounding-0.1.1.tar.gz
Upload date: Jan 29, 2026
Size: 5.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openadapt_grounding-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b8fd3abbddb0d8a6af4de89030aae0b136139db86e8c23db27d813e4c091c4f8`
MD5	`bd8343d235ccdc02e362c616602009fb`
BLAKE2b-256	`7c26b01888c02f8cb55a84ed3842fb97a326d00def9fd8dafb2b76aee47f4dba`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openadapt_grounding-0.1.1.tar.gz:

Publisher: release.yml on OpenAdaptAI/openadapt-grounding

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openadapt_grounding-0.1.1.tar.gz
- Subject digest: b8fd3abbddb0d8a6af4de89030aae0b136139db86e8c23db27d813e4c091c4f8
- Sigstore transparency entry: 871196492
- Sigstore integration time: Jan 29, 2026
Source repository:
- Permalink: OpenAdaptAI/openadapt-grounding@db01f86bb0e62df393b93c297374325169f893c0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/OpenAdaptAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@db01f86bb0e62df393b93c297374325169f893c0
- Trigger Event: push

File details

Details for the file openadapt_grounding-0.1.1-py3-none-any.whl.

File metadata

Download URL: openadapt_grounding-0.1.1-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 89.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openadapt_grounding-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7704742cd850491fee2109782b613ac8db018a4af907552bd0bc225835dd00b1`
MD5	`d7065637599adde1c20f45104a1a3ce4`
BLAKE2b-256	`a1f4eff17ff3fb1a239891e78ec40780eaf89a9d5b5171e73721cbee03d5e7a4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openadapt_grounding-0.1.1-py3-none-any.whl:

Publisher: release.yml on OpenAdaptAI/openadapt-grounding

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openadapt_grounding-0.1.1-py3-none-any.whl
- Subject digest: 7704742cd850491fee2109782b613ac8db018a4af907552bd0bc225835dd00b1
- Sigstore transparency entry: 871196495
- Sigstore integration time: Jan 29, 2026
Source repository:
- Permalink: OpenAdaptAI/openadapt-grounding@db01f86bb0e62df393b93c297374325169f893c0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/OpenAdaptAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@db01f86bb0e62df393b93c297374325169f893c0
- Trigger Event: push

openadapt-grounding 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

OpenAdapt Grounding

The Problem

The Solution

Results

Detection Stability

Resolution Robustness

Visual Output

Quick Start

Build a Registry (Offline)

Locate Elements (Runtime)

Run Demo

How It Works

Temporal Clustering

Text-Based Matching

OmniParser Integration

Deploy OmniParser Server

Monitor Deployment

Other Commands

Test Results

Use OmniParser with Temporal Smoothing

Analyze Detection Stability

UI-TARS Integration

Deploy UI-TARS Server

Use UI-TARS for Grounding

OmniParser vs UI-TARS

Evaluation & Benchmarking

Benchmark Results

Detection Rate by Method

Detection Rate by Element Size

Accuracy vs Latency Tradeoff

Synthetic Dataset Samples

Run Your Own Evaluation

Available Methods

API

RegistryBuilder

ElementLocator

LocatorResult

OmniParserClient

UITarsClient

GroundingResult

collect_frames(parser, image, num_frames, min_stability)

analyze_stability(parser, image, num_frames)

Documentation

Key Findings

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`RegistryBuilder`

`ElementLocator`

`LocatorResult`

`OmniParserClient`

`UITarsClient`

`GroundingResult`

`collect_frames(parser, image, num_frames, min_stability)`

`analyze_stability(parser, image, num_frames)`