Agent-first CLI for native UI automation with Set-of-Marks screenshots.
Project description
SoMatic
SoMatic is an agent-first CLI for native desktop UI automation. It runs a local YOLO model to detect and number every interactive element in a screenshot, giving the agent a structured coordinate map it can ground actions against. Elements can be targeted by mark ID, by nearest-mark offset, or by direct pixel coordinate — no guessing.
Every command returns JSON. The public binary is somatic.
Install
npm install -g @somatic-cli/cli
The npm package launches the Python core. Python 3.10+ must be available on PATH.
During npm postinstall, SoMatic creates a package-local virtualenv at .venv/ and pip installs the local source with the [vision] extra (~30 MB). Opt-outs:
SOMATIC_SKIP_POSTINSTALL=1— skip everything; install just the JS shim.SOMATIC_SKIP_PYTHON_BOOTSTRAP=1— skip the venv + pip install.SOMATIC_SKIP_VISION=1— install without the[vision]extra (annotated screenshots will not work; raw screenshots still do).
The [vision] extra pulls onnxruntime, numpy, and huggingface-hub. It does not pull torch or ultralytics — those are AGPL-3.0 and are kept out of the MIT distribution. The pre-converted YOLO ONNX is downloaded at runtime via somatic vision init; see Licensing below for the AGPL implications.
For Python-only installs:
# From PyPI:
pip install 'somatic-cli[vision]' # runtime only (~30 MB)
pip install 'somatic-cli[vision,mcp]' # add the MCP server (Claude Code, Cursor, Continue)
# From the repo:
pip install -e .[vision,mcp]
To add the SoMatic skill to Claude Code, Cursor, Copilot, and 30+ other agents:
npx skills add Smyan1909/SoMatic
This installs skills/somatic/SKILL.md into your agent's skills directory. The agent will then know the full operating loop without any further prompting.
To wire in the MCP server (inline annotated screenshots):
claude mcp add somatic -- npx @somatic-cli/cli mcp serve
See docs/mcp.md for full MCP setup and the .mcp.json snippet.
Quick Start
somatic doctor
somatic vision init
somatic screenshot --annotate
somatic click 3
somatic type "hello from SoMatic"
somatic hotkey ctrl s
somatic vision stop
somatic vision init is required before --annotate works. The first invocation downloads the OmniParser icon-detect YOLO weights and exports them to ONNX. Subsequent invocations reuse the cached file. The model stays resident in a background daemon until you call somatic vision stop.
Commands
somatic screenshot [--annotate]somatic click <id|x,y>somatic click-near <id|x,y> [--dx N] [--dy N]somatic double-click <id|x,y>somatic right-click <id|x,y>somatic middle-click <id|x,y>somatic mouse-down [id|x,y]somatic mouse-up [id|x,y]somatic move <id|x,y>somatic drag <id|x,y>somatic scroll <amount> [--target <id|x,y>]somatic type "text"somatic write "text"somatic hotkey ctrl ssomatic press entersomatic key-down shiftsomatic key-up shiftsomatic wait 1somatic positionsomatic sizesomatic locate image.png [--all]somatic center image.pngsomatic windows active|listsomatic failsafe [--enable|--disable]somatic pause 0.1somatic doctorsomatic bootstrapsomatic vision init|stop|statussomatic mcp serve(stdio MCP server — see docs/mcp.md)somatic skill(prints the operating-loop guidance)somatic headless start|stop|status|launch(Linux only — see docs/headless.md)
Platform Notes
Windows requires an interactive desktop session. Elevated applications may require an elevated terminal or agent host.
macOS requires Accessibility and Screen Recording permissions for the terminal or host app running SoMatic.
Linux works best on X11. Wayland compositors may block screenshot or pointer control unless desktop-specific permissions are configured.
Vision
The vision daemon exposes a local HTTP API:
GET /health— daemon status, weights path, PIDPOST /parsewith{ "image_path": "/absolute/path.png" }— runs YOLO ONNX inference, returns{ marks: [...], provider: "yolo-onnx", inference_ms: ... }
Marks contain id, bbox ([x1, y1, x2, y2] in image pixels), center ([x, y]), and confidence. There are no captions or OCR text — agents act on numbered boxes and resolve coordinates via the mark id.
somatic screenshot --annotate embeds the captured PNGs as base64 in the JSON response (image_b64, annotated_image_b64) so agents can pick them up in the same call. Pass --no-image to opt out.
Tuning knobs (env vars):
SOMATIC_YOLO_CONF(default0.05) — detection confidence thresholdSOMATIC_YOLO_IOU(default0.45) — NMS IoU thresholdSOMATIC_YOLO_ONNX_REPO— Hugging Face repo holding a pre-converted ONNXSOMATIC_YOLO_ONNX_PATH— point at a local ONNX file to skip download/conversion
See platform setup and release checklist.
Headless (Linux)
On Linux, SoMatic can spawn an Xvfb virtual desktop so actions run on a sandbox display rather than your real screen:
somatic headless start --launch discord --vnc
somatic screenshot --annotate # operates on the virtual desktop
somatic headless stop
See docs/headless.md for prerequisites and the full walkthrough.
Benchmarks
SoMatic's local YOLO detection is evaluated on ScreenSpot-Pro and VenusBench-GD against two baselines: raw GPT-5.5 with no detection hints, and SoMatic detection passed as text coordinate hints only (no visual overlay).
| Dataset | SoMatic+marks+GPT | SoMatic+coords+GPT | Raw GPT | Reference |
|---|---|---|---|---|
| ScreenSpot-Pro (n=200) | 68.5% | 73.0% | 52.0% | OmniParser + GPT-4o = 39.6%¹ |
| VenusBench-GD (n=171) | 70.2% | 78.4% | 59.6% | No published baseline available |
¹ From the ScreenSpot-Pro paper. Results here use GPT-5.5 and are subset-tier (n≈200); full-dataset numbers pending.
The coords arm (raw image + YOLO bounding boxes as text) edges out the visual overlay arm (marks) for top-tier VLMs — suggesting that for capable models the detection signal matters more than the annotation drawing. For weaker models and human-in-the-loop workflows, the visual overlay remains the recommended default.
Full per-platform and per-task-type breakdowns: benchmarks/results/RESULTS.md.
Licensing
SoMatic follows the FFmpeg licensing strategy: a strictly MIT-licensed core, with AGPL-licensed conversion tooling segregated into a separate directory so it never touches the published artifacts.
src/somatic/— MIT. The CLI, MCP server, vision daemon, automation primitives, and the YOLO ONNX inference path. Zero AGPL imports. A pytest-time check (tests/test_license_boundary.py) fails CI if anyone re-addsimport ultralyticshere.- YOLO ONNX weights — AGPL-3.0. Derived from
microsoft/OmniParser-v2.0's upstream YOLO checkpoint. SoMatic does not bundle the weights;somatic vision initdownloads them at runtime from a separately-licensed Hugging Face repository. Runsomatic licensefor the full notice. tools/— AGPL-3.0. Theconvert_yolo_to_onnx.pyscript importsultralytics(AGPL-3.0) and produces an AGPL-3.0 ONNX file. This directory is excluded from both the npm tarball and the PyPI sdist/wheel. Seetools/README.md.
Practically: if you pip install somatic-cli or npm install -g @somatic-cli/cli, your install is MIT. The moment you run somatic vision init you accept the AGPL-3.0 obligations on the downloaded weights. If you run anything from tools/, your derivative ONNX is also AGPL-3.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file somatic_cli-0.1.1.tar.gz.
File metadata
- Download URL: somatic_cli-0.1.1.tar.gz
- Upload date:
- Size: 42.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d689fdf24f78ad938a5e419c0120ac3fcb31329bf9d0f120af7a03ffee24ecb
|
|
| MD5 |
207d5f7f10f446f1f5f0aaa85068cd90
|
|
| BLAKE2b-256 |
153c0a93a1db40458f2b85c215af774b734916ef52ffbe81bd0088d08a92cdd8
|
File details
Details for the file somatic_cli-0.1.1-py3-none-any.whl.
File metadata
- Download URL: somatic_cli-0.1.1-py3-none-any.whl
- Upload date:
- Size: 42.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04b15b7a55236060f22092e417b48d9b1c1c4afc5e10eb5e179a865c0b87ea5e
|
|
| MD5 |
8086294b2c5a3e7fdf9b94116522d50e
|
|
| BLAKE2b-256 |
6ae3232b07f22db6fcd71cb3ad2e466768361f5eebe68a8756329414c34423c9
|