Skip to main content

Semantic Desktop Automation Framework for AI Agents via Windows UI Automation.

Project description

🐒 Tarsier

Semantic Desktop Automation Framework for AI Agents

The "Playwright" for Windows Desktop Apps.


🎯 What is Tarsier?

Tarsier is an open-source infrastructure layer designed to provide robust, deterministic interaction with Windows desktop applications.

Most "AI Computer Use" agents rely on taking screenshots, sending them to expensive vision models (like GPT-4V or Claude 3.5 Sonnet), and guessing X/Y pixel coordinates to click.

Tarsier takes a completely different approach.

Instead of screenshots, Tarsier hooks directly into the Windows UI Automation (UIA) accessibility layer. It extracts the exact structure of the application into a compact, semantic JSON tree (a "Desktop DOM") and allows interaction via semantic names and roles (e.g., "Click the Save button").

✨ Why use Tarsier over Vision Models?

  • 🚀 Zero Vision Models Needed: Completely eliminates the need for multimodal vision models.
  • 📉 Extremely Low Token Usage: An entire desktop UI JSON tree is often just a few hundred tokens, compared to the thousands of tokens required for an image.
  • 🎯 100% Deterministic: No hallucinated XY coordinates or missed clicks if a window resizes or a button moves.
  • 🧠 LLM Friendly: Large Language Models are fundamentally text-processing engines. Parsing a semantic JSON tree and returning text commands is what they do best!

[!WARNING] Important: Tarsier is NOT an autonomous AI agent. It has no intelligence, no reasoning, and no ability to plan. It is purely the deterministic "hands and eyes" infrastructure designed to be controlled by your LLM systems, MCP servers, or automation scripts.


💻 Supported OS

Tarsier is currently built specifically for Windows. Support for macOS and Linux accessibility trees is planned for the future.


🚀 What Tarsier CAN Do

  • Extract UI State: Recursively dump the semantic layout of an app (buttons, textboxes, tabs, menus) into LLM-friendly JSON.
  • Semantic Targeting: Query elements by their semantic properties (e.g., role="button", name="Save").
  • Semantic Actions: Perform clicks, double-clicks, and text input directly on the targeted elements.
  • Cross-App Support: Works on standard Win32 apps (Notepad) and modern UWP apps (Calculator).
  • Electron Support: Can interact with accessibility-enabled Electron apps (like VS Code).

🛑 What Tarsier CANNOT Do

  • ❌ Understand raw coordinate-based clicking (e.g., "click pixel 300x500").
  • ❌ Interact with video games or hardware-accelerated canvases that don't expose accessibility trees.
  • ❌ Solve Captchas, parse raw images, or run OCR pipelines.
  • ❌ Think for itself or plan autonomous agent workflows.

📦 Installation

You can install Tarsier directly from PyPI :

pip install tarsier-ai

Alternatively, to install from source for development:

git clone https://github.com/siddzzzz/Tarsier.git
cd Tarsier
pip install -r requirements.txt

🛠️ Usage & Examples

1. Opening an App

Start by creating a Desktop session and launching an application.

from tarsier import Desktop

desktop = Desktop()
notepad = desktop.open_app("notepad.exe", window_name="Notepad")

2. Dumping the "Desktop DOM" (Output Format)

Tarsier serializes the desktop state into a semantic JSON tree. This is exactly what you should feed to your LLM agent.

ui_state_json = notepad.to_json()
print(ui_state_json)

Example JSON Output:

{
  "role": "window",
  "name": "Untitled - Notepad",
  "elements": [
    {
      "role": "document",
      "name": "Text editor"
    },
    {
      "role": "menubar",
      "name": "System"
    },
    {
      "role": "button",
      "name": "Maximize"
    }
  ]
}

3. Finding Elements

You can query elements exactly like you would use query selectors in the browser.

# Generic find by role and name
save_btn = notepad.find(role="button", name="Save")

# Convenience wrappers
my_button = notepad.button("Submit")
my_text_box = notepad.textbox("Username")
my_menu_item = notepad.menu("File")

4. Semantic Interaction

Once you have an element, you can interact with it deterministically. No coordinates required!

# Click a button
notepad.button("Save").click()

# Double click
notepad.button("Folder").double_click()

# Type into a textbox instantly (uses clipboard injection to bypass OS racing)
editor = notepad.textbox()
editor.type("Hello from Tarsier!")

# Focus a specific element to ensure keystrokes land properly
editor.focus()

🤖 AI Agent Integration (MCP)

Tarsier comes with a built-in Model Context Protocol (MCP) server! This means you can plug Tarsier directly into AI agents like Claude Desktop or Cursor to let them autonomously control your Windows desktop using the semantic tools.

Available MCP Tools:

  • desktop_open_app: Launch or attach to a window.
  • desktop_get_ui: Dumps the JSON DOM for the AI to "see" the screen.
  • desktop_click: Semantically clicks an element.
  • desktop_type: Types text into an element.
  • desktop_read_text: Reads the internal text of a document or textbox.

Claude Desktop Integration:

Simply add Tarsier to your claude_desktop_config.json:

{
  "mcpServers": {
    "tarsier": {
      "command": "tarsier-mcp"
    }
  }
}

(Note: Ensure the python environment where you installed Tarsier is accessible in your system PATH).


🎮 Included Demos

Check out the examples/ directory for full working implementations:

  • notepad_demo.py: Opens Notepad, writes text, saves the file semantically.
  • calculator_demo.py: Operates the modern Windows Calculator app using pure semantic button queries.
  • vscode_demo.py: Opens VS Code, navigates the Windows OS file explorer dialogs, creates a workspace, writes Python code, and runs it!

Built with ❤️ for deterministic local AI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier_ai-0.2.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tarsier_ai-0.2.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file tarsier_ai-0.2.0.tar.gz.

File metadata

  • Download URL: tarsier_ai-0.2.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tarsier_ai-0.2.0.tar.gz
Algorithm Hash digest
SHA256 616481c12c0747a34f1c2e4dbcf8746f55a3dbc4dc7b330a6540cbd6abed0892
MD5 474f20f06adffb8745f664a2d46ac591
BLAKE2b-256 91f9ca3f0cef8023038ba05de6aa2f2fd82fbaf1dd299b88370fb9f5f84757ad

See more details on using hashes here.

File details

Details for the file tarsier_ai-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tarsier_ai-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tarsier_ai-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17fc42cc32834da9335f0e29452d5b9f6c1835076512e409b8344a2cf13313e3
MD5 62617f9328af38d29a896f7ec7c55408
BLAKE2b-256 b78b7821374243910b1106046b7595eb349d263e3f3bc97676d2694d067896dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page