Skip to main content

AI-powered UI element detection and device automation framework

Project description

UIAutoAgent

中文文档

AI-driven mobile UI automation framework with visual AI element detection and autonomous task execution.

Features

  • AI-powered visual element detection, no DOM required
  • Autonomous task planning and execution
  • Task memory with learning capabilities
  • Android / iOS device support
  • Flexible model configuration (different models per scenario with fallback chain)
  • Visual HTML reports (annotated screenshots, token usage, timing)
  • AI image content extraction (structured JSON output)
  • Startup model availability check for all candidates
  • Before/after screenshot comparison (AI judges whether an action took effect)

Installation

uv sync
cp .env.example .env
# Edit .env to configure API_KEY and model

Configuration

Configure an OpenAI-compatible API in .env:

# Core config (UIAUTO_ prefix recommended; legacy variable names still supported)
UIAUTO_BASE_URL=--openai-compatable--
UIAUTO_API_KEY=sk-xxx
UIAUTO_MODEL_NAME=doubao-seed-2.0-pro,glm-4.6v  # Default model candidates, tried in order

# Optional: different models per scenario
UIAUTO_MODEL_VISION=doubao-seed-2.0-pro  # Vision model candidates (planning + detection)
UIAUTO_MODEL_TEXT=gpt-4o-mini,deepseek-chat          # Text model candidates (summarization, etc.)

# Proxy (optional)
UIAUTO_MODEL_PROXY=http://127.0.0.1:7890

# Request timeout in seconds
UIAUTO_REQUEST_TIMEOUT=60

# Report output directory (optional)
UIAUTO_REPORT_DIR=/path/to/reports   # Reports written directly here; otherwise defaults to uiautoagent_reports/task_xxx/

# OpenRouter request tracking (optional)
OPENROUTER_SITE_URL=https://yoursite.com
OPENROUTER_SITE_NAME=YourAppName
SESSION_ID=my-session-123   # Auto-generated UUID if not set

Note: Environment variables have been upgraded with a UIAUTO_ prefix to avoid naming conflicts. Legacy variable names (e.g. BASE_URL, API_KEY) are still supported, but the new prefixed versions are recommended.

Model env vars support comma-separated candidate lists; order defines the fallback sequence. When a call fails, the next candidate is tried automatically.

All Environment Variables

Variable Legacy Fallback Default Description
UIAUTO_BASE_URL BASE_URL https://api.openai.com/v1 OpenAI-compatible API base URL
UIAUTO_API_KEY API_KEY API key
UIAUTO_MODEL_NAME MODEL_NAME doubao-seed-2.0-pro Default model candidates (comma-separated, tried in order)
UIAUTO_MODEL_VISION MODEL_VISION same as MODEL_NAME Vision model candidates (planning + detection, requires vision capability)
UIAUTO_MODEL_TEXT MODEL_TEXT same as MODEL_NAME Text model candidates (summarization, clarification, search)
UIAUTO_MODEL_PROXY MODEL_PROXY HTTP proxy (e.g. http://127.0.0.1:7890)
UIAUTO_REQUEST_TIMEOUT REQUEST_TIMEOUT 60 Request timeout in seconds
UIAUTO_REPORT_DIR Report output directory. When set, reports are written directly here instead of task_xxx/ subdirectories
OPENROUTER_SITE_URL OpenRouter site URL (request tracking)
OPENROUTER_SITE_NAME OpenRouter site name (request tracking)
SESSION_ID auto-generated UUID Session ID for request tracking

Quick Start

# AI autonomous task execution
uv run uiautoagent -m ai -t "Change nickname to kitty"

# Target an iOS device
uv run uiautoagent -m ai -t "Change nickname to kitty" -p ios

# Provide task context for higher success rate
uv run uiautoagent -m ai -t "Change nickname to kitty" -cf knowledge.txt

# Extract image content (structured JSON)
uv run uiautoagent -m extract -i screenshot.png -q "Extract all product prices"

# Extract with output format hint
uv run uiautoagent -m extract -i screenshot.png -q "Extract product info" --example '{"name":"Product","price":0}'

# Other modes
uv run uiautoagent -m find    # Find and click
uv run uiautoagent -m manual  # Manual control

Task Context

Use --context-file (-cf) to specify a text file, or --context (-c) to pass text directly, providing background information to help the AI locate elements and plan actions more accurately.

Example knowledge:

Path to change WeChat nickname: tap "Me" at bottom → tap avatar area → tap "Nickname" → edit and tap "Save"
The settings button is in the top-right corner, a gear icon

Useful when:

  • You know the specific path and want the AI to follow it directly
  • The app UI is complex and needs element location hints
  • The task requires domain-specific knowledge (e.g. special app behaviors)

All configured model candidates are checked at startup; at least one must be available per scenario:

🔍 Checking model availability (4 candidates)...
  ✅ 'glm-4.6v' [default #1]
  ❌ 'doubao-seed-2.0-pro' [vision #1]
  ✅ 'glm-5v-turbo' [vision #2]
  ✅ 'gpt-4o-mini' [text #1]

Task Reports

After each task execution, the following are generated under uiautoagent_reports/task_xxx/:

File Description
report.html Visual HTML report with annotated screenshots, raw AI responses, token usage, and timing
history.json Full step-by-step record (with token stats)
log.txt Real-time step log (appended after each step, human-readable text)
summary.txt Text summary
screenshots/ Original screenshots
annotated/ Screenshots annotated with tap locations and bounding boxes

Screenshot Similarity Feedback

The system compares screenshots before and after actions, computing a similarity score (0–1, 1 = identical), and feeds this back to the AI:

  • Similarity > 95%: Almost no change; AI may conclude the action had no effect
  • Similarity 85%–95%: Minor change
  • Similarity 70%–85%: Notable change; action likely took effect
  • Similarity < 70%: Major change

This helps the AI judge whether taps, swipes, etc. actually worked, informing its next move.

Python API

AI Autonomous Task Execution

from uiautoagent import run_ai_task

# Simplest usage — AI completes the task autonomously
result = run_ai_task("Change nickname to kitty")
if result.success:
    print(f"Task completed: {result.result}")
else:
    print(f"Task failed: {result.result}")

# Provide task context for higher success rate
result = run_ai_task(
    "Change nickname to kitty",
    context="WeChat path: tap 'Me' at bottom → tap avatar → tap 'Nickname' → edit → tap 'Save'",
)

# For observation tasks (e.g. "how many friends do I have")
result = run_ai_task("Check how many friends")
if result.success:
    print(f"Friend count: {result.result}")  # e.g. "5 friends"

Image Content Extraction

from uiautoagent import extract_content, ExtractionResult

# Free-form extraction — AI decides the JSON structure
result = extract_content("screenshot.png", "Extract all pricing info")
if result.success:
    print(result.content)  # dict or list

# Typed extraction — AI outputs in the given JSON format
result = extract_content(
    "screenshot.png",
    query="Extract product info",
    example={"name": "Product", "price": 0},
)

Element Detection

from uiautoagent import detect_element, draw_bbox

# Detect an element
result = detect_element("screenshot.png", "Login button")
if result.found:
    print(f"Position: {result.bbox}")
    draw_bbox("screenshot.png", result, "result.png")

Device Control

from uiautoagent import AndroidController, IOSController, SwipeDirection

# Android device control
controller = AndroidController()
controller.tap(500, 1000)
controller.long_press(500, 1000, duration_ms=1200)
controller.swipe_direction(SwipeDirection.UP)
controller.input_text("hello")
controller.back()
controller.app_launch("com.tencent.mm")  # Launch WeChat
controller.app_stop("com.tencent.mm")    # Stop WeChat
controller.app_reboot("com.tencent.mm")  # Restart WeChat

# iOS device control
controller = IOSController()  # Auto-detects USB device
controller.tap(500, 1000)
controller.long_press(500, 1000, duration_ms=1200)
controller.swipe_direction(SwipeDirection.UP)
controller.input_text("hello")
controller.home()
controller.app_launch("com.tencent.xin")  # Launch WeChat
controller.app_stop("com.tencent.xin")    # Stop WeChat
controller.app_reboot("com.tencent.xin")  # Restart WeChat

Direct AI Calls

from uiautoagent import Category, chat_completion

response = chat_completion(
    category=Category.TEXT,
    messages=[{"role": "user", "content": "Summarize this text"}],
    max_tokens=500,
)
content = response.choices[0].message.content

# When model is not explicitly passed, candidates for the category are tried in order

# Vision scenario (requires image)
vision_response = chat_completion(
    category=Category.VISION,
    messages=[{"role": "user", "content": "Analyze this image"}],
)

Token Statistics

from uiautoagent import TokenTracker

stats = TokenTracker.get_stats()
for category, stat in stats.items():
    print(f"{category}: {stat.total} tokens")

total = TokenTracker.get_total()
print(f"Total: {total.total} tokens")

AI-powered visual detection precisely identifies UI elements on screen:

Original screenshot sample.png

Detection result — query "close button" result.png

Requirements

  • Python 3.10+
  • OpenAI-compatible API
    • Vision scenarios (VISION) require a vision-capable model
    • Text scenarios (TEXT) work with any chat model
  • Android requires ADB
  • iOS requires WebDriverAgent and wdapy; device listing requires idevice_id (libimobiledevice) or tidevice

Reference

License

LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uiautoagent-0.1.2.tar.gz (158.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uiautoagent-0.1.2-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file uiautoagent-0.1.2.tar.gz.

File metadata

  • Download URL: uiautoagent-0.1.2.tar.gz
  • Upload date:
  • Size: 158.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for uiautoagent-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f792b023a3d82d203069cba1f580c9e74fbf1a75ee6d9b96317b334bde7e76a5
MD5 0dd9c8e59531774b7c5000fd32ee14b7
BLAKE2b-256 194104d20e49f5def7536b4a33db400365ae906b8fb3dfced6d75ce3d3232df2

See more details on using hashes here.

Provenance

The following attestation bundles were made for uiautoagent-0.1.2.tar.gz:

Publisher: release.yml on uiautodev/uiautoagent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file uiautoagent-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: uiautoagent-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for uiautoagent-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 764e2bf7285cd1f2d38c582e25035eaae82ce3ac2fd7b6c1302157e68e24a6ac
MD5 d884d36401e0e1f22354d13fcf3d1fe8
BLAKE2b-256 d143e46b109637fd7cb37927539c1f7e725a27b582f2dda35ff92d1b9c6922eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for uiautoagent-0.1.2-py3-none-any.whl:

Publisher: release.yml on uiautodev/uiautoagent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page