AI-powered UI element detection and device automation framework

Project description

UIAutoAgent

AI-driven mobile UI automation framework with visual AI element detection and autonomous task execution.

Features

AI-powered visual element detection, no DOM required
Autonomous task planning and execution
Task memory with learning capabilities
Android / iOS device support
Flexible model configuration (different models per scenario with fallback chain)
Visual HTML reports (annotated screenshots, token usage, timing)
AI image content extraction (structured JSON output)
Startup model availability check for all candidates
Before/after screenshot comparison (AI judges whether an action took effect)

Installation

uv sync
cp .env.example .env
# Edit .env to configure API_KEY and model

Configuration

Configure an OpenAI-compatible API in .env:

# Core config (UIAUTO_ prefix recommended; legacy variable names still supported)
UIAUTO_BASE_URL=--openai-compatable--
UIAUTO_API_KEY=sk-xxx
UIAUTO_MODEL_NAME=doubao-seed-2.0-pro,glm-4.6v  # Default model candidates, tried in order

# Optional: different models per scenario
UIAUTO_MODEL_VISION=doubao-seed-2.0-pro  # Vision model candidates (planning + detection)
UIAUTO_MODEL_TEXT=gpt-4o-mini,deepseek-chat          # Text model candidates (summarization, etc.)

# Proxy (optional)
UIAUTO_MODEL_PROXY=http://127.0.0.1:7890

# Request timeout in seconds
UIAUTO_REQUEST_TIMEOUT=60

# Report output directory (optional)
UIAUTO_REPORT_DIR=/path/to/reports   # Reports written directly here; otherwise defaults to uiautoagent_reports/task_xxx/

# OpenRouter request tracking (optional)
OPENROUTER_SITE_URL=https://yoursite.com
OPENROUTER_SITE_NAME=YourAppName
SESSION_ID=my-session-123   # Auto-generated UUID if not set

Note: Environment variables have been upgraded with a UIAUTO_ prefix to avoid naming conflicts. Legacy variable names (e.g. BASE_URL, API_KEY) are still supported, but the new prefixed versions are recommended.

Model env vars support comma-separated candidate lists; order defines the fallback sequence. When a call fails, the next candidate is tried automatically.

All Environment Variables

Variable	Legacy Fallback	Default	Description
`UIAUTO_BASE_URL`	`BASE_URL`	`https://api.openai.com/v1`	OpenAI-compatible API base URL
`UIAUTO_API_KEY`	`API_KEY`	—	API key
`UIAUTO_MODEL_NAME`	`MODEL_NAME`	`doubao-seed-2.0-pro`	Default model candidates (comma-separated, tried in order)
`UIAUTO_MODEL_VISION`	`MODEL_VISION`	same as `MODEL_NAME`	Vision model candidates (planning + detection, requires vision capability)
`UIAUTO_MODEL_TEXT`	`MODEL_TEXT`	same as `MODEL_NAME`	Text model candidates (summarization, clarification, search)
`UIAUTO_MODEL_PROXY`	`MODEL_PROXY`	—	HTTP proxy (e.g. `http://127.0.0.1:7890`)
`UIAUTO_REQUEST_TIMEOUT`	`REQUEST_TIMEOUT`	`60`	Request timeout in seconds
`UIAUTO_REPORT_DIR`	—	—	Report output directory. When set, reports are written directly here instead of `task_xxx/` subdirectories
`OPENROUTER_SITE_URL`	—	—	OpenRouter site URL (request tracking)
`OPENROUTER_SITE_NAME`	—	—	OpenRouter site name (request tracking)
`SESSION_ID`	—	auto-generated UUID	Session ID for request tracking

Quick Start

# AI autonomous task execution
uv run uiautoagent -m ai -t "Change nickname to kitty"

# Target an iOS device
uv run uiautoagent -m ai -t "Change nickname to kitty" -p ios

# Provide task context for higher success rate
uv run uiautoagent -m ai -t "Change nickname to kitty" -cf knowledge.txt

# Extract image content (structured JSON)
uv run uiautoagent -m extract -i screenshot.png -q "Extract all product prices"

# Extract with output format hint
uv run uiautoagent -m extract -i screenshot.png -q "Extract product info" --example '{"name":"Product","price":0}'

# Other modes
uv run uiautoagent -m find    # Find and click
uv run uiautoagent -m manual  # Manual control

Task Context

Use --context-file (-cf) to specify a text file, or --context (-c) to pass text directly, providing background information to help the AI locate elements and plan actions more accurately.

Example knowledge:

Path to change WeChat nickname: tap "Me" at bottom → tap avatar area → tap "Nickname" → edit and tap "Save"
The settings button is in the top-right corner, a gear icon

Useful when:

You know the specific path and want the AI to follow it directly
The app UI is complex and needs element location hints
The task requires domain-specific knowledge (e.g. special app behaviors)

All configured model candidates are checked at startup; at least one must be available per scenario:

🔍 Checking model availability (4 candidates)...
  ✅ 'glm-4.6v' [default #1]
  ❌ 'doubao-seed-2.0-pro' [vision #1]
  ✅ 'glm-5v-turbo' [vision #2]
  ✅ 'gpt-4o-mini' [text #1]

Task Reports

After each task execution, the following are generated under uiautoagent_reports/task_xxx/:

File	Description
`report.html`	Visual HTML report with annotated screenshots, raw AI responses, token usage, and timing
`history.json`	Full step-by-step record (with token stats)
`log.txt`	Real-time step log (appended after each step, human-readable text)
`summary.txt`	Text summary
`screenshots/`	Original screenshots
`annotated/`	Screenshots annotated with tap locations and bounding boxes

Screenshot Similarity Feedback

The system compares screenshots before and after actions, computing a similarity score (0–1, 1 = identical), and feeds this back to the AI:

Similarity > 95%: Almost no change; AI may conclude the action had no effect
Similarity 85%–95%: Minor change
Similarity 70%–85%: Notable change; action likely took effect
Similarity < 70%: Major change

This helps the AI judge whether taps, swipes, etc. actually worked, informing its next move.

Python API

AI Autonomous Task Execution

from uiautoagent import run_ai_task

# Simplest usage — AI completes the task autonomously
result = run_ai_task("Change nickname to kitty")
if result.success:
    print(f"Task completed: {result.result}")
else:
    print(f"Task failed: {result.result}")

# Provide task context for higher success rate
result = run_ai_task(
    "Change nickname to kitty",
    context="WeChat path: tap 'Me' at bottom → tap avatar → tap 'Nickname' → edit → tap 'Save'",
)

# For observation tasks (e.g. "how many friends do I have")
result = run_ai_task("Check how many friends")
if result.success:
    print(f"Friend count: {result.result}")  # e.g. "5 friends"

Image Content Extraction

from uiautoagent import extract_content, ExtractionResult

# Free-form extraction — AI decides the JSON structure
result = extract_content("screenshot.png", "Extract all pricing info")
if result.success:
    print(result.content)  # dict or list

# Typed extraction — AI outputs in the given JSON format
result = extract_content(
    "screenshot.png",
    query="Extract product info",
    example={"name": "Product", "price": 0},
)

Element Detection

from uiautoagent import detect_element, draw_bbox

# Detect an element
result = detect_element("screenshot.png", "Login button")
if result.found:
    print(f"Position: {result.bbox}")
    draw_bbox("screenshot.png", result, "result.png")

Device Control

from uiautoagent import AndroidController, IOSController, SwipeDirection

# Android device control
controller = AndroidController()
controller.tap(500, 1000)
controller.long_press(500, 1000, duration_ms=1200)
controller.swipe_direction(SwipeDirection.UP)
controller.input_text("hello")
controller.back()
controller.app_launch("com.tencent.mm")  # Launch WeChat
controller.app_stop("com.tencent.mm")    # Stop WeChat
controller.app_reboot("com.tencent.mm")  # Restart WeChat

# iOS device control
controller = IOSController()  # Auto-detects USB device
controller.tap(500, 1000)
controller.long_press(500, 1000, duration_ms=1200)
controller.swipe_direction(SwipeDirection.UP)
controller.input_text("hello")
controller.home()
controller.app_launch("com.tencent.xin")  # Launch WeChat
controller.app_stop("com.tencent.xin")    # Stop WeChat
controller.app_reboot("com.tencent.xin")  # Restart WeChat

Direct AI Calls

from uiautoagent import Category, chat_completion

response = chat_completion(
    category=Category.TEXT,
    messages=[{"role": "user", "content": "Summarize this text"}],
    max_tokens=500,
)
content = response.choices[0].message.content

# When model is not explicitly passed, candidates for the category are tried in order

# Vision scenario (requires image)
vision_response = chat_completion(
    category=Category.VISION,
    messages=[{"role": "user", "content": "Analyze this image"}],
)

Token Statistics

from uiautoagent import TokenTracker

stats = TokenTracker.get_stats()
for category, stat in stats.items():
    print(f"{category}: {stat.total} tokens")

total = TokenTracker.get_total()
print(f"Total: {total.total} tokens")

AI-powered visual detection precisely identifies UI elements on screen:

Original screenshot

Detection result — query "close button"

Requirements

Python 3.10+
OpenAI-compatible API
- Vision scenarios (VISION) require a vision-capable model
- Text scenarios (TEXT) work with any chat model
Android requires ADB
iOS requires WebDriverAgent and wdapy; device listing requires idevice_id (libimobiledevice) or tidevice

Reference

Google Paper: Repeated Prompters Improve Accuracy https://arxiv.org/pdf/2512.14982

License

LICENSE

Project details

Release history Release notifications | RSS feed

0.2.0

Jun 1, 2026

This version

0.1.2

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uiautoagent-0.1.2.tar.gz (158.2 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uiautoagent-0.1.2-py3-none-any.whl (54.6 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file uiautoagent-0.1.2.tar.gz.

File metadata

Download URL: uiautoagent-0.1.2.tar.gz
Upload date: Jun 1, 2026
Size: 158.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for uiautoagent-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`f792b023a3d82d203069cba1f580c9e74fbf1a75ee6d9b96317b334bde7e76a5`
MD5	`0dd9c8e59531774b7c5000fd32ee14b7`
BLAKE2b-256	`194104d20e49f5def7536b4a33db400365ae906b8fb3dfced6d75ce3d3232df2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for uiautoagent-0.1.2.tar.gz:

Publisher: release.yml on uiautodev/uiautoagent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: uiautoagent-0.1.2.tar.gz
- Subject digest: f792b023a3d82d203069cba1f580c9e74fbf1a75ee6d9b96317b334bde7e76a5
- Sigstore transparency entry: 1689997158
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: uiautodev/uiautoagent@9d3c29b6a4b9e684c6c8d94a0c7bb7542895b559
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/uiautodev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9d3c29b6a4b9e684c6c8d94a0c7bb7542895b559
- Trigger Event: push

File details

Details for the file uiautoagent-0.1.2-py3-none-any.whl.

File metadata

Download URL: uiautoagent-0.1.2-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 54.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for uiautoagent-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`764e2bf7285cd1f2d38c582e25035eaae82ce3ac2fd7b6c1302157e68e24a6ac`
MD5	`d884d36401e0e1f22354d13fcf3d1fe8`
BLAKE2b-256	`d143e46b109637fd7cb37927539c1f7e725a27b582f2dda35ff92d1b9c6922eb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for uiautoagent-0.1.2-py3-none-any.whl:

Publisher: release.yml on uiautodev/uiautoagent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: uiautoagent-0.1.2-py3-none-any.whl
- Subject digest: 764e2bf7285cd1f2d38c582e25035eaae82ce3ac2fd7b6c1302157e68e24a6ac
- Sigstore transparency entry: 1689997410
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: uiautodev/uiautoagent@9d3c29b6a4b9e684c6c8d94a0c7bb7542895b559
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/uiautodev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9d3c29b6a4b9e684c6c8d94a0c7bb7542895b559
- Trigger Event: push

uiautoagent 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

UIAutoAgent

Features

Installation

Configuration

All Environment Variables

Quick Start

Task Context

Task Reports

Screenshot Similarity Feedback

Python API

AI Autonomous Task Execution

Image Content Extraction

Element Detection

Device Control

Direct AI Calls

Token Statistics

Requirements

Reference

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance