MCP server & CLI for controlling windows visually — capture screenshots, OCR text extraction, and keyboard/mouse input
Project description
Visual Window Control
MCP server & CLI for controlling windows visually — capture screenshots, extract text via OCR (Tesseract), and send keyboard/mouse input to any target window. Designed for remote desktop workflows (RDP, etc.) but works with any window.
Requirements
- Windows 10/11
- Python 3.10+
- Tesseract OCR
Installation
# Install Tesseract OCR (via Chocolatey or manual download)
choco install tesseract
# Install the package
pip install -e .
Usage
CLI (vwctl)
# List all visible windows
vwctl list-windows
# Capture and OCR a window (by title)
vwctl -w "Remote Desktop" ocr
# Type text with inline tags
vwctl -w "Remote Desktop" type "ls -la{enter}"
# Type from stdin (pipe or streaming)
echo "ls -la{enter}" | vwctl -w "Remote Desktop" type
# Send a special key with modifiers
vwctl -w "Remote Desktop" key c -m ctrl
# Send a key with custom delay (wait 800ms after key press)
vwctl -w "Remote Desktop" key f -m alt -d 800
# Send a key sequence with per-step timing control (delay_ms in ms)
vwctl -w "Remote Desktop" keys '[{"key":"tab"},{"key":"enter","delay_ms":500}]'
# Click at coordinates relative to window
vwctl -w "Remote Desktop" click 400 300
# Execute a command and read output via OCR
vwctl -w "Remote Desktop" exec "ls -la" -W 2.0
# Capture screenshot to file (default: JPEG quality 85)
vwctl -w "Remote Desktop" capture
# → Saved: 2026-03-07_22-24-00_vwctl.jpg (1920x1080)
# Capture with custom JPEG quality 1-95 (default: 85)
vwctl -w "Remote Desktop" capture -q 60
# Capture with custom filename (use .png extension for PNG output)
vwctl -w "Remote Desktop" capture -o screen.png
# Capture occluded window without bringing to foreground
# (uses PrintWindow API; may produce black images for hardware-accelerated apps)
vwctl -w "Remote Desktop" capture -b
# Use hwnd instead of title (faster, no search overhead)
vwctl -H 1234567 ocr
# Send input without stealing focus (works with cmd.exe, Git Bash, PuTTY, etc.)
vwctl -w "Command Prompt" -n type "dir{enter}"
Subcommands
| Command | Description |
|---|---|
list-windows |
List all visible windows with hwnd and title |
type [TEXT] [-f FILE] |
Type text with inline {tag} support (reads from stdin if omitted; -f to read from file, -f - for explicit stdin) |
key KEY [-m MOD] [-d MS] |
Send a single key press with optional modifiers and delay |
keys JSON |
Send a key sequence from JSON array (per-step key, modifiers, delay_ms) |
click X Y [-b] |
Click at position relative to window |
move X Y [-r] |
Move mouse cursor (absolute or relative) |
drag X1 Y1 X2 Y2 |
Drag mouse from start to end position |
scroll AMOUNT |
Scroll mouse wheel (+up, -down) |
capture [-o FILE] [-q Q] [-b] |
Capture window to JPEG file or base64 stdout (.png extension for PNG) |
ocr [-b] |
Capture window and extract text via OCR |
exec CMD [-W SEC] |
Type command, Enter, wait, then OCR output |
Global Options
| Option | Description |
|---|---|
-w, --window TITLE |
Target window by title (partial match) |
-H, --hwnd HWND |
Target window by handle directly |
-c, --config FILE |
Config file path |
-n, --no-focus |
Send input via PostMessage without stealing focus |
Configuration
Settings can be provided via config file, environment variables, or CLI arguments. Priority: CLI args > env vars > config file.
Config File
TOML format. Search order (first found wins):
--config FILE/VWCTL_CONFIGenv var./vwctl.toml(current directory)~/.config/vwctl/config.toml(Linux) /%APPDATA%\vwctl\config.toml(Windows)
Example vwctl.toml:
window = "Remote Desktop"
ocr_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
capture_log_dir = "./captures"
jpeg_quality = 85
no_focus = false
Environment Variables
| Variable | Description |
|---|---|
VWCTL_WINDOW |
Default target window title |
VWCTL_HWND |
Default target window handle |
VWCTL_OCR_CMD |
Tesseract executable path |
VWCTL_CAPTURE_LOG_DIR |
Default directory for capture output |
VWCTL_JPEG_QUALITY |
JPEG quality 1-95 (default: 85) |
VWCTL_NO_FOCUS |
Send input via PostMessage without stealing focus (1/true) |
VWCTL_CONFIG |
Config file path |
MCP Server
Add to your MCP client configuration (e.g. .claude.json):
{
"mcpServers": {
"visual-window-control": {
"type": "stdio",
"command": "mcp-visual-window-control"
}
}
}
The MCP server exposes the same functionality as the CLI as tools: list_windows, set_target_window, get_screen_text, get_screen_image, send_keys, send_special_key, send_key_sequence, click, mouse_move, mouse_drag, mouse_scroll, execute_and_read, list_child_windows, get_focus_info.
send_keys and send_key_sequence automatically detect focus loss: if the target window loses foreground focus during input, the operation is aborted and the tool returns an "Aborted: target window lost focus (sent X/Y ...)" message instead of the normal result.
Inline Tags (send_keys / type)
Text input supports {tag} syntax for special keys:
"ls -la{enter}" → types "ls -la" then presses Enter
"awk '{print $1}' file.txt{enter}" → braces pass through (not a known tag)
"echo {{enter}}" → types "echo {enter}" (escaped)
"{ctrl+c}" → sends Ctrl+C
Whitelist-based: Only recognized key names are interpreted as tags. Unknown {content} passes through literally, so code with curly braces (awk, Python, shell) works without escaping.
Supported keys: {enter}, {tab}, {escape}, {backspace}, {delete}, {up}, {down}, {left}, {right}, {home}, {end}, {pageup}, {pagedown}, {space}, {f1}–{f12}
Modifiers: {ctrl+c}, {alt+f4}, {shift+tab}
Escaping: {{ → literal {, }} → literal }
Supported Characters
Each mode accepts a specific set of characters. Text containing unsupported characters (e.g. escape sequences, null bytes) will be rejected with an error before any keystrokes are sent.
| Mode | Accepted characters | Special keys |
|---|---|---|
| Tag mode (default for text arg) | Printable characters (U+0020–U+007E, U+0080+) | Via {tag} syntax: {enter}, {tab}, {ctrl+c}, etc. |
Raw mode (-r, default for stdin/file) |
Printable characters + \t (Tab) + line endings (\n, \r\n, \r → Enter) |
None (modifier combos like Ctrl+C not available) |
Choosing a mode: Use raw mode (-r) for multi-line or long text input where modifier key combinations are not needed. Use tag mode for interactive sequences that require special keys or modifiers.
Sending arbitrary data: If your text contains control characters or escape sequences (e.g. ANSI codes), encode it as base64 and decode on the remote side:
# Encode locally, type via raw mode, decode on remote
base64 -w0 binary_file.dat | vwctl -H HWND type -f -
# Then on the remote side: echo "<pasted>" | base64 -d > file
# Or as a single pipeline command:
echo "echo '$(base64 -w0 binary_file.dat)' | base64 -d > /tmp/file{enter}" | vwctl -H HWND type -t
Raw Mode
Disable all tag interpretation. Line endings (\n, \r\n, \r) are sent as Enter key presses, and tab characters (\t) are sent as Tab. For multi-line or long text input where modifier key combinations (e.g. {ctrl+c}) are not needed, raw mode (-r) is recommended.
# CLI
vwctl -w "Remote Desktop" type -r "echo hello
echo world
"
# MCP: {"text": "echo hello\necho world\n", "raw": true}
Stdin and File Input
When the text argument is omitted, type reads from stdin line by line. Use --file/-f to read from a file, or -f - for explicit stdin.
Stdin and file input default to raw mode (no tag interpretation), since the typical use case is piping file/program output. Use -t/--tags to enable tag interpretation for these sources.
# Pipe from another command (raw by default)
echo "ls -la" | vwctl -w "Remote Desktop" type
# Explicit stdin with "-f -"
cat commands.txt | vwctl -w "Remote Desktop" type -f -
# Read from a file directly (raw by default)
vwctl -w "Remote Desktop" type -f commands.txt
# File input with tag interpretation
vwctl -w "Remote Desktop" type -f commands.txt -t
# Streaming (line-by-line as data arrives)
tail -f commands.fifo | vwctl -w "Remote Desktop" type
If both a text argument and stdin are present, the text argument wins (stdin is ignored).
-r/--raw and -t/--tags are mutually exclusive.
Focus Loss Detection
The type and keys commands (and MCP send_keys / send_key_sequence tools) monitor whether the target window remains in the foreground during input. If another window takes focus, input is immediately aborted:
# type command
Aborted: target window lost focus (typed 42 characters)
# keys command
Aborted: target window lost focus (sent 2/5 key steps)
This prevents keystrokes from being sent to an unintended window. Focus checking is disabled in no-focus mode (-n for CLI, no_focus: true for MCP).
Key Delay (delay_ms)
After each key press, vwctl waits for a configurable delay before proceeding to the next action. This gives the target application time to process the keystroke (especially important for remote desktop apps, menus, and GUI transitions).
Default delays (when delay_ms is not specified):
| Context | Default delay |
|---|---|
key / keys commands (focus mode) |
600 ms |
key / keys commands (no-focus mode, -n) |
100 ms |
Inline {tag} in type command |
100 ms |
Plain text in type command |
20 ms (per character) |
Overriding the delay:
-
keyscommand (CLI): setdelay_msper step in the JSON array.vwctl -H HWND keys '[{"key":"alt+f","delay_ms":800},{"key":"s","delay_ms":200}]'
-
send_special_keyMCP tool: set thedelay_msparameter directly. -
send_key_sequenceMCP tool: setdelay_msper step in thestepsarray. -
keycommand (CLI): set--delay/-din milliseconds.vwctl -H HWND key f -m alt -d 800
When to adjust: Increase the delay for slow UI transitions (e.g. menu opening, dialog loading). Decrease it for fast sequential keypresses where the default 600 ms is too slow.
Limitations
- Focus stealing: When sending input to the target window, focus is moved to that window by default. This is required for the input to be received by the target application.
- No-focus mode (
-n/--no-focus): An option exists to send input viaPostMessagewithout stealing focus, but this only works with certain native Windows applications (e.g.cmd.exe, Git Bash, PuTTY). Remote desktop applications (RDP, Guacamole, VNC, etc.) do not support no-focus input — they require the window to be focused and in the foreground to receive keyboard/mouse events. - Admin privileges: When the target application runs as admin, the controlling process must also run as admin due to Windows UIPI restrictions.
OCR Tips
- Use monospace fonts (JetBrains Mono, Hack, Fira Code) at 24pt+
- Use high-contrast terminal themes
- Larger window sizes improve accuracy
For LLM Agents
See LLM.md for a CLI reference designed for LLM agents — recommended workflow, command examples, and common patterns.
Tested With
- Windows 11, Python 3.12+, Tesseract 5.x
- Remote Desktop (mstsc.exe)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file visual_window_control-0.3.0.tar.gz.
File metadata
- Download URL: visual_window_control-0.3.0.tar.gz
- Upload date:
- Size: 41.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0378a2677aecb6a53b81080b26bc00748caae05ac82f00cae59ff123952a3667
|
|
| MD5 |
5511797c3ea03f9fc26ec7f42e1f73ce
|
|
| BLAKE2b-256 |
4e4affb6eeb66baf9286648b23d3f43e31b8ffa9cad85f4ada5e3570990e7a3a
|
Provenance
The following attestation bundles were made for visual_window_control-0.3.0.tar.gz:
Publisher:
publish.yml on sunasaji/visual-window-control
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
visual_window_control-0.3.0.tar.gz -
Subject digest:
0378a2677aecb6a53b81080b26bc00748caae05ac82f00cae59ff123952a3667 - Sigstore transparency entry: 1148180173
- Sigstore integration time:
-
Permalink:
sunasaji/visual-window-control@57b2a411865f5fec682716cf3745ecd99ac21b7e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/sunasaji
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@57b2a411865f5fec682716cf3745ecd99ac21b7e -
Trigger Event:
release
-
Statement type:
File details
Details for the file visual_window_control-0.3.0-py3-none-any.whl.
File metadata
- Download URL: visual_window_control-0.3.0-py3-none-any.whl
- Upload date:
- Size: 29.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91339f5e2b5fe2f80b6a06f1fb283790eff916f416fd599be782253bdc680300
|
|
| MD5 |
b09f20c3428a4c5de780d14cfdec0aad
|
|
| BLAKE2b-256 |
64d60a688644c262b967fde819dd184cf09b96d79fa061650e4809e10d439c31
|
Provenance
The following attestation bundles were made for visual_window_control-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on sunasaji/visual-window-control
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
visual_window_control-0.3.0-py3-none-any.whl -
Subject digest:
91339f5e2b5fe2f80b6a06f1fb283790eff916f416fd599be782253bdc680300 - Sigstore transparency entry: 1148180638
- Sigstore integration time:
-
Permalink:
sunasaji/visual-window-control@57b2a411865f5fec682716cf3745ecd99ac21b7e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/sunasaji
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@57b2a411865f5fec682716cf3745ecd99ac21b7e -
Trigger Event:
release
-
Statement type: