Skip to main content

Pixel-level browser automation MCP server for WSL2 — drive a real Chrome with screenshot + xdotool, no CDP.

Project description

hermes-computer-use

CI License: MIT Python 3.11+ Platform: WSL2 Ubuntu

Scope: Windows 11 + WSL2 Ubuntu 22.04 / 24.04 only. This project intentionally limits its support matrix — native Linux / macOS / Windows are not targets. See docs/WSL_SETUP.md for why and for the full setup walkthrough.

Pixel-level browser automation MCP server. Gives any MCP-speaking agent (hermes-agent, Claude Code, Codex, …) 21 tools to drive a real Chrome browser running in an Xvfb display: screenshots as vision input, OS-level mouse/keyboard as output. No CDP. No navigator.webdriver. No DOM shortcuts.

Think of it as the Linux-side reproduction of Anthropic's computer-use-demo — but exposed over stdio MCP so you can pair it with any agent runtime and any vision-capable model.

agent ── stdio MCP ──▶ hermes_computer_use.server ── subprocess ──▶ xdotool / scrot
                                                                          │
                                                                          ▼
                                                                      Xvfb :99
                                                                          │
                                                          ┌───────────────┴────────────────┐
                                                          ▼                                ▼
                                                    x11vnc :5900              websockify + noVNC :6080
                                                (native VNC clients)            (browser viewer)

See docs/ARCHITECTURE.md for the longer version.

Why

Playwright / CDP hermes-computer-use
navigator.webdriver true (detectable) undefined
CDP endpoint open none
DOM access direct (fast, brittle to markup changes) screenshot only (slower, resilient to selector renames)
Anti-bot footprint large, constantly patched near-zero: stock Chrome, stock X11 input
Best for reliable flows on sites you own agents operating unfamiliar sites like a human

If your automation has to walk a login funnel on a site with Cloudflare, Kasada, or reCAPTCHA sprinkled on it, this stack usually passes where Playwright gets stopped — because the browser is indistinguishable from a stock Chrome driven by a stock X server.

Install

Prerequisites (Windows host): Windows 11, WSL2 with an Ubuntu 22.04 or 24.04 distro, and systemd enabled in WSL. Full walkthrough in docs/WSL_SETUP.md.

Everything below runs inside the WSL shell, not in PowerShell.

git clone https://github.com/Noah3521/hermes-computer-use.git ~/hermes-computer-use
cd ~/hermes-computer-use

# 1. System packages (sudo): Xvfb, fluxbox, x11vnc, xdotool, ydotool, scrot,
#    ImageMagick, CJK fonts, Google Chrome, plus uinput if available.
bash scripts/setup.sh

# 2. Python package
python3 -m venv .venv
. .venv/bin/activate
pip install -e ".[novnc]"       # omit [novnc] if you don't want the web viewer

# 3. Optional browser-based observer at http://localhost:6080/vnc.html
bash scripts/install-novnc.sh

# 4. Persistent services
mkdir -p ~/.config/systemd/user
cp systemd/computer-use.service.example ~/.config/systemd/user/computer-use.service
cp systemd/novnc.service.example        ~/.config/systemd/user/novnc.service
sudo loginctl enable-linger "$USER"
systemctl --user daemon-reload
systemctl --user enable --now computer-use.service novnc.service

Smoke test:

python examples/smoke_test.py

Wire to hermes-agent

Paste config/hermes.yaml.example into your ~/.hermes/config.yaml under mcp_servers:, then hermes gateway run --replace. The model immediately gets the full tool surface.

The same config shape works for any stdio-MCP client (Claude Code, mcp-inspector, custom runners).

Tools

Category Tools
Status screen_info, cursor_position
Capture screenshot (base64 PNG)
Pointer move, left_click, right_click, double_click, middle_click, drag, scroll
Keyboard type_text, press_key, hold_key
Timing wait
Browser open_url, new_tab, close_tab, back, forward, reload
Escape hatch run_shell

Full signatures live in src/hermes_computer_use/server.py and are discoverable via MCP tools/list.

Demo prompts

examples/demo_prompts.md ships ten graduated prompts from a 5-second sanity check to a 5-hop Google → external site → SSO-login flow that passes without captchas. Open the noVNC tab while running them — watching the pointer interpolate through Google's search box is surprisingly compelling.

Configuration

All runtime behaviour is controlled by env vars. Sensible defaults everywhere.

Var Default Meaning
CU_DISPLAY 99 X display number
CU_WIDTH / CU_HEIGHT 1440 / 900 Virtual screen size
CU_VNC_PORT 5900 x11vnc listen port
CU_STATE_DIR /tmp/hermes-computer-use Logs, PID files
CU_PROFILE_DIR $CU_STATE_DIR/chrome-profile Persistent Chrome profile (cookies survive restarts)
CU_START_URL about:blank First URL Chrome opens
CU_INPUT xdotool Set to ydotool for kernel /dev/uinput input
CU_KEY_DELAY_MS 25 Inter-keystroke delay
CU_MOVE_STEPS 18 Interpolation steps for move(human=True) and drag

Troubleshooting

See docs/TROUBLESHOOTING.md. The usual suspects:

  • scrot: Can't open X display → Xvfb died. systemctl --user restart computer-use.service.
  • Chrome immediately exits → sandbox / dev-shm issue. The scripts/display.sh launcher already sets the right flags; if you hand-roll, copy from there.
  • Stack dies on logoutsudo loginctl enable-linger $USER.
  • Google flags "unusual traffic" → IP reputation, not behavioural. Use a residential proxy or prewarm with a manual login via VNC.

Security

This is an LLM with hands. Read docs/SECURITY.md before pointing it at anything you care about. At minimum:

  • Run in an isolated WSL distro or VM — never your daily driver.
  • Remove the run_shell tool if the agent does not need a shell.
  • Do not persist real credentials in CU_PROFILE_DIR.

Contributing

See CONTRIBUTING.md. Scope guardrails are strict: no DOM selectors, no OCR, no anti-detection arms race. The thesis is "emit no abnormal signals" > "emit clever evasions".

License

MIT. See LICENSE.

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hermes_computer_use-0.1.0.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hermes_computer_use-0.1.0-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file hermes_computer_use-0.1.0.tar.gz.

File metadata

  • Download URL: hermes_computer_use-0.1.0.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for hermes_computer_use-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4544118ffe76c4bc34500e7b715ffa6b4529a0af56827f571ad3d755c1d8a257
MD5 0d95a81713182c342d6a73fb26edd4a8
BLAKE2b-256 c95a5739d46d098978dc20050d5a54ce8c8b008139d96838973221259f53cfaf

See more details on using hashes here.

File details

Details for the file hermes_computer_use-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hermes_computer_use-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 159d27c135403451633d79ce4700eb8ae50b287bc7235e2e8333bc14703cb9b2
MD5 3e594de980ca161566e7fab8e3c6324b
BLAKE2b-256 402feae4bf91c91bfe09fefea778db81f21584322c589cc0e8e55e1f81c6f3c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page