Skip to main content

Open Computer Use Agent — framework for desktop and browser automation

Project description

opendesk

Open Computer Use Agent — gives any AI agent eyes and hands on your desktop.

opendesk runs as an MCP server. Install it, register it with your agent tool, and it adds screenshot, accessibility-based UI control, mouse, keyboard, clipboard, and OCR to every conversation — on macOS, Linux, and Windows.


Quickstart

pip install 'opendesk[core,mcp]'

Claude Code — one command:

claude mcp add opendesk -- opendesk-mcp

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "opendesk": { "command": "opendesk-mcp" }
  }
}

Cursor / Continue — same pattern, point command at opendesk-mcp.

That's it. Your agent can now say "take a screenshot", "click the Save button", or "type Hello World into TextEdit" and it will work.


What it adds to your agent

Tool What it does
ui Clicks buttons, types text, reads values — by element name, no pixel coordinates. Uses the platform's native accessibility tree (AppleScript / AT-SPI2 / UI Automation).
screenshot Captures the screen. With marks=True, overlays numbered boxes on every interactive element so the agent can say "click mark 3".
mouse Pixel-level mouse control with automatic Retina/HiDPI scaling. Last resort when ui has nothing to click.
keyboard Types text (full Unicode), presses keys, sends hotkeys.
app Opens, closes, and focuses applications.
clipboard Reads and writes the system clipboard.
ocr Extracts text from any screen region without sending to the LLM.

The agent follows a natural priority: ui first (no coordinates needed) → screenshot(marks=True) to see numbered elements → mouse as last resort for unlabelled canvas areas.


How the MCP integration works

Claude Code / Claude Desktop / Cursor / Continue
          |
          | MCP stdio
          v
     opendesk-mcp
          |
          +-- screenshot, ui, mouse, keyboard, app, clipboard, ocr

opendesk starts as a child process, speaks the MCP protocol over stdin/stdout, and the LLM client handles all tool-calling automatically. You never write tool-calling code.


Why opendesk?

  • MCP-first — works out of the box with any MCP client, zero glue code.
  • Accessibility tree first — the ui tool interacts with apps the same way a screen reader does, without pixel coordinates or Retina scaling headaches.
  • Framework-agnostic — also ships Anthropic SDK, OpenAI, and LangChain adapters.
  • Sandboxed — per-session audit log, app allow-list, screen region constraints.
  • Extensible — one class to add a custom tool; it appears in all integrations automatically.

Installation

# Minimal (just the framework, no hardware deps)
pip install opendesk

# Core computer use: screen capture + mouse/keyboard
pip install 'opendesk[core]'

# With MCP server support
pip install 'opendesk[core,mcp]'

# Everything
pip install 'opendesk[all]'

System dependencies

Platform Required
macOS Screen Recording permission (System Settings → Privacy → Screen Recording); Accessibility permission for mouse/keyboard
Linux xclip for clipboard; xdotool or pyatspi for keyboard/UI
Windows No extra system deps (uses Win32 APIs)

Quick start

import asyncio
from opendesk import create_registry, allow_all_context

async def main():
    registry = create_registry()
    ctx = allow_all_context()

    # Take a screenshot with Set-of-Marks overlay
    screenshot = registry.get("screenshot")
    result = await screenshot.execute(ctx, screenshot.Params(marks=True))
    print(result.output)   # lists all interactive elements as [1] Button "OK" ...
    # result.attachments[0].content  -> PNG bytes

    # Click a button by name — no pixel coordinates needed
    ui = registry.get("ui")
    await ui.execute(ctx, ui.Params(action="click", app="Safari", title="Go"))

    # Type text
    kb = registry.get("keyboard")
    await kb.execute(ctx, kb.Params(action="type", text="hello world"))

asyncio.run(main())

Integrations

MCP server (Claude Desktop, Continue, Cursor, ...)

Run the MCP server over stdio:

opendesk-mcp

Add to Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "opendesk": {
      "command": "opendesk-mcp"
    }
  }
}

Or create a server in Python:

from opendesk.integrations.mcp import create_mcp_server
from opendesk.registry import create_registry
from mcp.server.stdio import stdio_server

server = create_mcp_server(create_registry())
async with stdio_server() as (r, w):
    await server.run(r, w, server.create_initialization_options())

Claude Code / Anthropic SDK

import anthropic
from opendesk.integrations.claude_code import ClaudeCodeAdapter
from opendesk.registry import create_registry

client = anthropic.Anthropic()
adapter = ClaudeCodeAdapter(create_registry())

messages = [{"role": "user", "content": "Open Safari and take a screenshot"}]

# Full agentic loop (handles tool use automatically)
final_text = await adapter.run_loop(
    client,
    model="claude-opus-4-6",
    messages=messages,
    system="You are a computer use agent. Use the ui tool first, mouse as last resort.",
)
print(final_text)

Manual control:

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    tools=adapter.tool_definitions(),
    messages=messages,
)

# Dispatch all tool_use blocks in parallel
tool_results = await adapter.handle_response(response)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})

OpenAI function calling

Works with OpenAI, Groq, Together AI, Ollama, LiteLLM, and any OpenAI-compatible provider:

from openai import OpenAI
from opendesk.integrations.openai_compat import OpenAIAdapter
from opendesk.registry import create_registry

client = OpenAI()
adapter = OpenAIAdapter(create_registry())

messages = [{"role": "user", "content": "Take a screenshot"}]
final_text = await adapter.run_loop(client, model="gpt-4o", messages=messages)

LangChain / LangGraph

from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

from opendesk.integrations.langchain_compat import as_langchain_tools
from opendesk.registry import create_registry

tools = as_langchain_tools(create_registry())
agent = create_react_agent(ChatOpenAI(model="gpt-4o"), tools)
result = agent.invoke({"messages": [("user", "Take a screenshot")]})

Tools

Tool Description
ui Click, type, and read UI elements by name via the accessibility tree. Use this first.
screenshot Capture screen with Set-of-Marks overlay, cursor dot, zoom, and change detection
mouse Click, scroll, drag — with automatic Retina/HiDPI coordinate scaling
keyboard Type (full Unicode), press keys, hotkeys, hold
app Open, close, focus, or list applications
clipboard Read or write the system clipboard
ocr Extract text from any screen region (pytesseract / Vision / WinRT)

Tool priority

When the agent needs to interact with a UI element:

  1. ui tool — click by element title/role, no coordinates needed. Most reliable.
  2. screenshot with marks=True — if ui doesn't find the element, get a SoM overlay showing numbered bounding boxes.
  3. mouse with image_width/image_height — last resort for unlabelled canvas areas. Always provide image dimensions for correct Retina scaling.

Architecture

opendesk/
├── tools/          # Tool definitions (base.py + one file per tool)
├── computer/       # Low-level helpers: capture, marks (SoM), OCR, sandbox
├── integrations/   # MCP, Claude Code, OpenAI, LangChain adapters
└── registry.py     # ToolRegistry + create_registry()

See docs/architecture.md for a deep dive.


Permission model

Every tool action goes through a ToolContext.check_permission() call before execution.

from opendesk.tools.base import allow_all_context, interactive_context

# Headless / autonomous — approve everything automatically
ctx = allow_all_context()

# Interactive — prompt on stdout before each action
ctx = interactive_context()

# Custom handler — integrate with your own UI or policy engine
async def my_handler(tool: str, argument: str, description: str) -> None:
    if "production" in description.lower():
        raise PermissionDeniedError("Refusing to act on production.")

from opendesk.tools.base import ToolContext
ctx = ToolContext(session_id="my-session", permission_handler=my_handler)

Platform support

Feature macOS Linux Windows
Screenshot mss + Pillow mss + Pillow mss + Pillow
Mouse control pyautogui pyautogui pyautogui
Keyboard (Unicode) pbcopy + cmd+v xclip/xsel + ctrl+v pyperclip + ctrl+v
AX tree (ui tool) AppleScript AT-SPI2 / xdotool pywinauto
SoM marks AppleScript pyatspi pywinauto
OCR pytesseract / Vision pytesseract pytesseract / WinRT
App open/close open -a / AppleScript xdg-open / pkill start / taskkill

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendesk-0.1.0.tar.gz (55.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opendesk-0.1.0-py3-none-any.whl (63.8 kB view details)

Uploaded Python 3

File details

Details for the file opendesk-0.1.0.tar.gz.

File metadata

  • Download URL: opendesk-0.1.0.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.16

File hashes

Hashes for opendesk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4ffd52c427e0a677626d145537ebb4493042ccfde5c48586d3393c75014d6cca
MD5 fb13a3a1fee005040b235fc00efd1f48
BLAKE2b-256 bb32565d4fd9a80f42770ac8bcc1e6d534a85886f6b8921104442adc0a34d0ae

See more details on using hashes here.

File details

Details for the file opendesk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: opendesk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 63.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.16

File hashes

Hashes for opendesk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 351adc4e2ec4aefb1de7972b3ecaaae69abf0ab90295320199be73c1df8fce68
MD5 6aae7c351d10597c2ae1f3545d1ed3b7
BLAKE2b-256 0496c474d19bf0faf5c75ac41d4f97ab472e06d280a8219417d5957abea01243

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page