Skip to main content

Semantic Desktop Automation Framework for AI Agents via Windows UI Automation.

Project description

🐒 Tarsier

Semantic Desktop Automation Framework for AI Agents

The "Playwright" for Windows Desktop Apps.


🎯 What is Tarsier?

Tarsier is an open-source infrastructure layer designed to provide robust, deterministic interaction with Windows desktop applications.

Most "AI Computer Use" agents rely on taking screenshots, sending them to expensive vision models (like GPT-4V or Claude 3.5 Sonnet), and guessing X/Y pixel coordinates to click.

Tarsier takes a completely different approach.

Instead of screenshots, Tarsier hooks directly into the Windows UI Automation (UIA) accessibility layer. It extracts the exact structure of the application into a compact, semantic JSON tree (a "Desktop DOM") and allows interaction via semantic names and roles (e.g., "Click the Save button").

✨ Why use Tarsier over Vision Models?

  • 🚀 Zero Vision Models Needed: Completely eliminates the need for multimodal vision models.
  • 📉 Extremely Low Token Usage: An entire desktop UI JSON tree is often just a few hundred tokens, compared to the thousands of tokens required for an image.
  • 🎯 100% Deterministic: No hallucinated XY coordinates or missed clicks if a window resizes or a button moves.
  • 🧠 LLM Friendly: Large Language Models are fundamentally text-processing engines. Parsing a semantic JSON tree and returning text commands is what they do best!

[!WARNING] Important: Tarsier is NOT an autonomous AI agent. It has no intelligence, no reasoning, and no ability to plan. It is purely the deterministic "hands and eyes" infrastructure designed to be controlled by your LLM systems, MCP servers, or automation scripts.


💻 Supported OS

Tarsier is currently built specifically for Windows. Support for macOS and Linux accessibility trees is planned for the future.


🚀 What Tarsier CAN Do

  • Extract UI State: Recursively dump the semantic layout of an app (buttons, textboxes, tabs, menus) into LLM-friendly JSON.
  • Semantic Targeting: Query elements by their semantic properties (e.g., role="button", name="Save").
  • Semantic Actions: Perform clicks, double-clicks, and text input directly on the targeted elements.
  • Cross-App Support: Works on standard Win32 apps (Notepad) and modern UWP apps (Calculator).
  • Electron Support: Can interact with accessibility-enabled Electron apps (like VS Code).

🛑 What Tarsier CANNOT Do

  • ❌ Understand raw coordinate-based clicking (e.g., "click pixel 300x500").
  • ❌ Interact with video games or hardware-accelerated canvases that don't expose accessibility trees.
  • ❌ Solve Captchas, parse raw images, or run OCR pipelines.
  • ❌ Think for itself or plan autonomous agent workflows.

📦 Installation

You can install Tarsier directly from PyPI :

pip install tarsier-ai

Alternatively, to install from source for development:

git clone https://github.com/siddzzzz/Tarsier.git
cd Tarsier
pip install -r requirements.txt

🛠️ Usage & Examples

1. Opening an App

Start by creating a Desktop session and launching an application.

from tarsier import Desktop

desktop = Desktop()
notepad = desktop.open_app("notepad.exe", window_name="Notepad")

2. Dumping the "Desktop DOM" (Output Format)

Tarsier serializes the desktop state into a semantic JSON tree. This is exactly what you should feed to your LLM agent.

ui_state_json = notepad.to_json()
print(ui_state_json)

Example JSON Output:

{
  "role": "window",
  "name": "Untitled - Notepad",
  "elements": [
    {
      "role": "document",
      "name": "Text editor"
    },
    {
      "role": "menubar",
      "name": "System"
    },
    {
      "role": "button",
      "name": "Maximize"
    }
  ]
}

3. Finding Elements

You can query elements exactly like you would use query selectors in the browser.

# Generic find by role and name
save_btn = notepad.find(role="button", name="Save")

# Convenience wrappers
my_button = notepad.button("Submit")
my_text_box = notepad.textbox("Username")
my_menu_item = notepad.menu("File")

4. Semantic Interaction

Once you have an element, you can interact with it deterministically. No coordinates required!

# Click a button
notepad.button("Save").click()

# Double click
notepad.button("Folder").double_click()

# Type into a textbox instantly (uses clipboard injection to bypass OS racing)
editor = notepad.textbox()
editor.type("Hello from Tarsier!")

# Focus a specific element to ensure keystrokes land properly
editor.focus()

🤖 AI Agent Integration (MCP)

Tarsier comes with a built-in Model Context Protocol (MCP) server! This means you can plug Tarsier directly into AI agents like Claude Desktop or Cursor to let them autonomously control your Windows desktop using the semantic tools.

Available MCP Tools:

  • desktop_open_app: Launch or attach to a window.
  • desktop_get_ui: Dumps the JSON DOM for the AI to "see" the screen.
  • desktop_click: Semantically clicks an element.
  • desktop_type: Types text into an element.
  • desktop_read_text: Reads the internal text of a document or textbox.

Claude Desktop Integration:

Simply add Tarsier to your claude_desktop_config.json:

{
  "mcpServers": {
    "tarsier": {
      "command": "tarsier-mcp"
    }
  }
}

(Note: Ensure the python environment where you installed Tarsier is accessible in your system PATH).


🎮 Included Demos

Check out the examples/ directory for full working implementations:

  • notepad_demo.py: Opens Notepad, writes text, saves the file semantically.
  • calculator_demo.py: Operates the modern Windows Calculator app using pure semantic button queries.
  • vscode_demo.py: Opens VS Code, navigates the Windows OS file explorer dialogs, creates a workspace, writes Python code, and runs it!

Built with ❤️ for deterministic local AI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier_ai-0.3.0.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tarsier_ai-0.3.0-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file tarsier_ai-0.3.0.tar.gz.

File metadata

  • Download URL: tarsier_ai-0.3.0.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tarsier_ai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ec05df9529b53b6a81beb823604b195ec522794d1cc5b42e54883cccc8d4d12d
MD5 7423fead658bc9aaec0a2f6f063c56f0
BLAKE2b-256 8508266f454c75d7daa48580b7a0a43896ace4def6d177b1d49b58e81d52d27d

See more details on using hashes here.

File details

Details for the file tarsier_ai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: tarsier_ai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tarsier_ai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ce809001e78ae898933f3faf1fef4431a5b450e602c590e417b28fd4c4315cb
MD5 b0330c3a16fed03bf85946e25ab39e7a
BLAKE2b-256 50007a49cbd6017573e8a169747b83aff9f5c1018d0a46411d467bd3cf16de53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page