PyVisionAuto: Cross-platform desktop automation toolkit with visual image matching, mouse/keyboard control, and screen recording
Project description
PyVisionAuto
PyVisionAuto is an end-to-end desktop automation toolkit. It is centered on visual image matching and also includes screen recording, mouse automation, and keyboard automation capabilities.
Runtime screenshot
Highlighted match region during runtime:
Scope
- Linux (X11 session) and Windows
- Real physical display required
Install
pip install pyvisionauto
Main APIs
Screen.find(image, confidence=None, region=None)Screen.wait(image, timeout=10, confidence=None, region=None)Screen.click(image, timeout=10, confidence=None, button="left", region=None)Screen.wait_vanish(image, timeout=10, confidence=None, region=None)Screen.click_and_wait_vanish(image, timeout=10, confidence=None, region=None)Screen.activate_window(title_substring)Screen.region(x, y, w, h)for region-scoped operationsMatchHandle.highlight().click()for fluent chainingInput.type_text(text),Input.press(key),Input.hotkey(*keys)Recorder.start(output_file, fps=15),Recorder.stop()
System dependencies
Linux
- python3-tk — Required for border overlay highlight
- xdotool — Preferred for window activation
- wmctrl — Fallback for window activation
- ffmpeg — Required for screen recording; install via
sudo apt install ffmpeg
Windows
- tkinter — Bundled with most Python installations
- ffmpeg — Required for screen recording; download from ffmpeg.org, extract archive, and add the
binfolder to systemPATH
Verify ffmpeg installation
# Check if ffmpeg is installed and accessible
ffmpeg -version
Note: Screen recording (via
RecorderAPI) requires ffmpeg. On Linux, it usesx11grabcodec; on Windows, it usesgdigrabcodec. Both are built into ffmpeg by default.
Quick start
Basic usage: Find and click
from pyvisionauto import Screen
screen = Screen()
# Wait for image to appear on screen, highlight it, then click
screen.wait("login_button.png", timeout=10).highlight().click()
Advanced example: Record automation with screen capture
This example demonstrates screen recording combined with visual automation:
from pyvisionauto import Screen, Recorder
from pathlib import Path
screen = Screen()
recorder = Recorder()
recorder.start(output_file=Path("automation_demo.mp4"))
try:
screen.activate_window("Calculator")
screen.wait("button_1.png", timeout=10).highlight().click()
screen.click("button_plus.png", timeout=5)
screen.type_text("5")
screen.wait("button_equals.png", timeout=5).highlight().click()
screen.wait("result_7.png", timeout=3).highlight()
finally:
recorder.stop()
Activate a window before matching
screen.activate_window("Calculator")
screen.click("button.png")
Platform differences
| Feature | Linux | Windows |
|---|---|---|
| Screen capture & template matching | Supported | Supported |
| Mouse / keyboard automation | Supported | Supported |
| Highlight overlay | Supported | Supported |
| Window activation | xdotool / wmctrl | pyautogui (pygetwindow) |
| Screen recording | ffmpeg + x11grab | ffmpeg + gdigrab |
Screen recording requires ffmpeg installed and added to system PATH. Linux uses
x11grab, Windows usesgdigrab.
Window focus on Linux (X11)
On X11 systems, mouse clicks alone do not automatically change keyboard focus. The window manager only reassigns focus in response to real hardware events or explicit window activation requests. This means:
click()moves the cursor to the correct coordinates and clicks, but the keyboard focus stays wherever it was before.- Any subsequent keyboard action (
press(),type_text(), hotkeys) is delivered to whichever window currently has focus — which may not be the window you just clicked.
Rule of thumb: always call activate_window() before any keyboard action, targeting the exact window that should receive it.
Use xdotool to find the precise window name while the application is running:
xdotool search --name "" 2>/dev/null | while read id; do
printf "ID=%-12s %s\n" "$id" "$(xdotool getwindowname "$id" 2>/dev/null)"
done
Pick the shortest substring that uniquely identifies the target window and use it in activate_window().
Main window vs. dialogs
When a modal dialog is open, activate the dialog directly — do not activate the main window and rely on the WM to forward focus:
from pyvisionauto import Screen
screen = Screen()
# --- Interacting with a dialog ---
# 1. Wait for the dialog image to appear and click it
screen.wait("open_project_dialog.png", timeout=30).click()
# 2. Activate the dialog window so keyboard input goes to it
screen.activate_window("Open Project") # activate the dialog, not the main window
# 3. Now keyboard actions are reliably delivered to the dialog
screen.input.press("esc")
# --- Interacting with the main window ---
screen.activate_window("My App 2026")
screen.wait("toolbar_button.png", timeout=10).click()
Why not just activate the main window? On GNOME/Mutter, activating the main window does propagate focus to a modal child dialog — but this is WM-specific behaviour. Activating the dialog directly is explicit, portable, and not dependent on WM modal-focus rules.
highlight() and focus
highlight() launches a temporary tkinter overlay window. On some window managers this overlay can briefly steal keyboard focus. To avoid side effects:
- Prefer
.click()before.highlight(), not after — the API supports chaining in both directions. - Do not rely on focus being intact after
.highlight()returns; callactivate_window()again if keyboard actions follow.
# Safer pattern: click first, highlight after (for visual feedback only)
screen.wait("button.png", timeout=10).click().highlight()
# Risky pattern: highlight steals focus, click lands on wrong window
# screen.wait("button.png", timeout=10).highlight().click() # avoid
Notes
- Wayland-only and headless environments are not currently supported.
- On Windows with high-DPI scaling, coordinate accuracy may be affected.
Acknowledgments
This project is inspired by Sikulix and built with:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyvisionauto-0.1.6.tar.gz.
File metadata
- Download URL: pyvisionauto-0.1.6.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74ff1be6f4a2043337a28cb176202388cf2fc51fa669a6a38855742378a2f450
|
|
| MD5 |
eb1ae748c9bddf1d00a1854051e1cb37
|
|
| BLAKE2b-256 |
5a812d6076c5da06458c68860a322ce98093fd96b03e14e9596f9006455c84e7
|
File details
Details for the file pyvisionauto-0.1.6-py3-none-any.whl.
File metadata
- Download URL: pyvisionauto-0.1.6-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
727c74a7e3d17a90a6178b35d4ca05b573495b36f48517e09fc14d5cef06c3b7
|
|
| MD5 |
4b972c7aa53b20bd7102b525fd1550f8
|
|
| BLAKE2b-256 |
255e1907986985b615dcfd6c1b9df63eb845a747a9977966a62d5c128899e852
|