Pixel-level browser automation MCP server for WSL2 — drive a real Chrome with screenshot + xdotool, no CDP.
Project description
hermes-computer-use
Scope: Windows 11 + WSL2 Ubuntu 22.04 / 24.04 only. This project intentionally limits its support matrix — native Linux / macOS / Windows are not targets. See docs/WSL_SETUP.md for why and for the full setup walkthrough.
Pixel-level browser automation MCP server. Gives any MCP-speaking agent (hermes-agent, Claude Code, Codex, …) 21 tools to drive a real Chrome browser running in an Xvfb display: screenshots as vision input, OS-level mouse/keyboard as output. No CDP. No navigator.webdriver. No DOM shortcuts.
Think of it as the Linux-side reproduction of Anthropic's computer-use-demo — but exposed over stdio MCP so you can pair it with any agent runtime and any vision-capable model.
agent ── stdio MCP ──▶ hermes_computer_use.server ── subprocess ──▶ xdotool / scrot
│
▼
Xvfb :99
│
┌───────────────┴────────────────┐
▼ ▼
x11vnc :5900 websockify + noVNC :6080
(native VNC clients) (browser viewer)
See docs/ARCHITECTURE.md for the longer version.
Why
| Playwright / CDP | hermes-computer-use | |
|---|---|---|
navigator.webdriver |
true (detectable) |
undefined |
| CDP endpoint | open | none |
| DOM access | direct (fast, brittle to markup changes) | screenshot only (slower, resilient to selector renames) |
| Anti-bot footprint | large, constantly patched | near-zero: stock Chrome, stock X11 input |
| Best for | reliable flows on sites you own | agents operating unfamiliar sites like a human |
If your automation has to walk a login funnel on a site with Cloudflare, Kasada, or reCAPTCHA sprinkled on it, this stack usually passes where Playwright gets stopped — because the browser is indistinguishable from a stock Chrome driven by a stock X server.
Install
Prerequisites (Windows host): Windows 11, WSL2 with an Ubuntu 22.04 or 24.04 distro, and systemd enabled in WSL. Full walkthrough in docs/WSL_SETUP.md.
Everything below runs inside the WSL shell, not in PowerShell.
git clone https://github.com/Noah3521/hermes-computer-use.git ~/hermes-computer-use
cd ~/hermes-computer-use
# 1. System packages (sudo): Xvfb, fluxbox, x11vnc, xdotool, ydotool, scrot,
# ImageMagick, CJK fonts, Google Chrome, plus uinput if available.
bash scripts/setup.sh
# 2. Python package
python3 -m venv .venv
. .venv/bin/activate
pip install -e ".[novnc]" # omit [novnc] if you don't want the web viewer
# 3. Optional browser-based observer at http://localhost:6080/vnc.html
bash scripts/install-novnc.sh
# 4. Persistent services
mkdir -p ~/.config/systemd/user
cp systemd/computer-use.service.example ~/.config/systemd/user/computer-use.service
cp systemd/novnc.service.example ~/.config/systemd/user/novnc.service
sudo loginctl enable-linger "$USER"
systemctl --user daemon-reload
systemctl --user enable --now computer-use.service novnc.service
Smoke test:
python examples/smoke_test.py
Wire to hermes-agent
Paste config/hermes.yaml.example into your ~/.hermes/config.yaml under mcp_servers:, then hermes gateway run --replace. The model immediately gets the full tool surface.
The same config shape works for any stdio-MCP client (Claude Code, mcp-inspector, custom runners).
Tools
| Category | Tools |
|---|---|
| Status | screen_info, cursor_position |
| Capture | screenshot (base64 PNG) |
| Pointer | move, left_click, right_click, double_click, middle_click, drag, scroll |
| Keyboard | type_text, press_key, hold_key |
| Timing | wait |
| Browser | open_url, new_tab, close_tab, back, forward, reload |
| Escape hatch | run_shell |
Full signatures live in src/hermes_computer_use/server.py and are discoverable via MCP tools/list.
Demo prompts
examples/demo_prompts.md ships ten graduated prompts from a 5-second sanity check to a 5-hop Google → external site → SSO-login flow that passes without captchas. Open the noVNC tab while running them — watching the pointer interpolate through Google's search box is surprisingly compelling.
Configuration
All runtime behaviour is controlled by env vars. Sensible defaults everywhere.
| Var | Default | Meaning |
|---|---|---|
CU_DISPLAY |
99 |
X display number |
CU_WIDTH / CU_HEIGHT |
1440 / 900 |
Virtual screen size |
CU_VNC_PORT |
5900 |
x11vnc listen port |
CU_STATE_DIR |
/tmp/hermes-computer-use |
Logs, PID files |
CU_PROFILE_DIR |
$CU_STATE_DIR/chrome-profile |
Persistent Chrome profile (cookies survive restarts) |
CU_START_URL |
about:blank |
First URL Chrome opens |
CU_INPUT |
xdotool |
Set to ydotool for kernel /dev/uinput input |
CU_KEY_DELAY_MS |
25 |
Inter-keystroke delay |
CU_MOVE_STEPS |
18 |
Interpolation steps for move(human=True) and drag |
Troubleshooting
See docs/TROUBLESHOOTING.md. The usual suspects:
scrot: Can't open X display→ Xvfb died.systemctl --user restart computer-use.service.- Chrome immediately exits → sandbox / dev-shm issue. The
scripts/display.shlauncher already sets the right flags; if you hand-roll, copy from there. - Stack dies on logout →
sudo loginctl enable-linger $USER. - Google flags "unusual traffic" → IP reputation, not behavioural. Use a residential proxy or prewarm with a manual login via VNC.
Security
This is an LLM with hands. Read docs/SECURITY.md before pointing it at anything you care about. At minimum:
- Run in an isolated WSL distro or VM — never your daily driver.
- Remove the
run_shelltool if the agent does not need a shell. - Do not persist real credentials in
CU_PROFILE_DIR.
Contributing
See CONTRIBUTING.md. Scope guardrails are strict: no DOM selectors, no OCR, no anti-detection arms race. The thesis is "emit no abnormal signals" > "emit clever evasions".
License
MIT. See LICENSE.
Acknowledgements
- anthropic-quickstarts/computer-use-demo for the reference loop.
- x11vnc + noVNC for the observer pipeline.
- Model Context Protocol for making "tool surface you can point any agent at" a real thing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hermes_computer_use-0.1.0.tar.gz.
File metadata
- Download URL: hermes_computer_use-0.1.0.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4544118ffe76c4bc34500e7b715ffa6b4529a0af56827f571ad3d755c1d8a257
|
|
| MD5 |
0d95a81713182c342d6a73fb26edd4a8
|
|
| BLAKE2b-256 |
c95a5739d46d098978dc20050d5a54ce8c8b008139d96838973221259f53cfaf
|
File details
Details for the file hermes_computer_use-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hermes_computer_use-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
159d27c135403451633d79ce4700eb8ae50b287bc7235e2e8333bc14703cb9b2
|
|
| MD5 |
3e594de980ca161566e7fab8e3c6324b
|
|
| BLAKE2b-256 |
402feae4bf91c91bfe09fefea778db81f21584322c589cc0e8e55e1f81c6f3c5
|