KAI Agent — AI-powered vision-based desktop automation using Claude Vision
Project description
AIK (AI Keyboard) - Vision-Based Keyboard Automation
A Windows-based AI agent that uses Claude Vision (Haiku 4.5) to understand your screen and perform keyboard-only automation tasks. The agent captures screenshots, analyzes them with AI, and executes keyboard actions to accomplish your goals.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ User Space (Python Agent) │
├─────────────────────────────────────────────────────────────┤
│ User Goal Input → Agent Controller (main.py) │
│ ↓ │
│ AI Logic Loop │
│ ├── Window Manager (pywin32) ← Context │
│ ├── Vision Module (mss/PIL) ← Capture │
│ └── LLM Client (Anthropic) → Action Plan │
│ ↓ │
│ Driver Interface (ctypes) → IOCTL (Scancodes) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Kernel Space (Ring 0) │
├─────────────────────────────────────────────────────────────┤
│ Kernel Keyboard Filter Driver (KMDF) │
│ └── Inject → Windows Input Stack (kbdclass) │
│ ↓ │
│ Target Environment: Any App / System Prompts (UAC) │
└─────────────────────────────────────────────────────────────┘
Features
- Vision-based AI: Uses Claude Vision to understand screen content
- Keyboard-only automation: Executes type_text, key_press, hotkey actions
- Kill switch: Press
Ctrl+Alt+Backspaceto stop immediately - User-mode injection: Works with most applications via SendInput
- Kernel driver support (optional): For bypassing UIPI restrictions
- History-aware memory: Persists step-by-step execution history (with screenshot context), summarizes older steps, and avoids immediate repeat-loops
Requirements
- Windows 10/11 (64-bit)
- Python 3.11+
- Anthropic API key with vision access
Quick Start
1. Install dependencies
pip install mss pywin32 pynput httpx pillow python-dotenv
Or use the requirements file:
pip install -r requirements.txt
2. Configure API key
Edit .env file:
ANTHROPIC_API_KEY=your-api-key-here
ANTHROPIC_MODEL=claude-haiku-4-5-20251001
3. Run the agent
Dry-run (prints actions without executing):
python main.py --goal "Open Notepad and type Hello World" --dry-run
Live mode (actually types):
python main.py --goal "Type 'Hello World' and press Enter"
Elevated mode (type into admin apps):
python main.py --elevate --goal "Type: Hello from elevated context"
Note: Elevation still cannot interact with the UAC secure desktop or login screen.
Interactive terminal mode (re-enter goals without retyping full command):
python tools/interactive_run.py
Voice Terminal (Multilingual)
You can run the voice-to-terminal utility with multilingual speech recognition.
python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --languages "en-IN,hi-IN,ta-IN"
Enable AI fallback for natural Hindi/Hinglish instructions:
python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --languages "en-IN,hi-IN" --ai-command-map
For multi-step spoken tasks (for example, "open excel then type data save and email"), the tool now delegates to the main agent automatically:
python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --delegate-to-agent
Single-language usage:
python tools/voice_type_terminal.py --provider google --run-command --language "en-US"
Command-line Options
| Option | Default | Description |
|---|---|---|
--goal |
(required) | What you want the agent to accomplish |
--dry-run |
False | Print actions without injecting keys |
--max-steps |
40 | Maximum planning cycles |
--interval |
0.8 | Seconds between planning cycles |
--monitor |
1 | mss monitor index (1=primary) |
--screenshot-max-width |
1280 | Downscale screenshots for API |
--model |
claude-haiku-4-5-20251001 | Anthropic model ID |
--log-level |
INFO | Logging verbosity |
Action Schema
The AI returns JSON with keyboard actions:
{
"actions": [
{"type": "type_text", "text": "Hello World"},
{"type": "key_press", "key": "enter"},
{"type": "hotkey", "keys": ["ctrl", "s"]},
{"type": "wait_ms", "ms": 500},
{"type": "stop", "reason": "Task completed"}
]
}
Supported Actions
| Action | Fields | Description |
|---|---|---|
type_text |
text |
Type a string |
key_press |
key |
Press a single key (enter, tab, f1-f24, a-z, 0-9) |
hotkey |
keys |
Press key combo (["ctrl", "c"]) |
wait_ms |
ms |
Wait milliseconds (0-60000) |
stop |
reason |
Stop the agent |
History-Aware Agent Memory
The agent maintains an internal conversation history so it can remember what it already did across steps:
- Keeps the original goal pinned
- Stores per-step memory (observations, planned actions, executed actions, success/failure, timestamps)
- Summarizes older steps to avoid token blowups (keeps recent steps with screenshots)
- Performs conservative dedup (skips immediate repeat actions that just succeeded in the prior step)
Project Structure
├── main.py # Entry point
├── aik/
│ ├── agent.py # Main agent loop
│ ├── anthropic_client.py # Claude API client
│ ├── capture.py # Screen capture (mss)
│ ├── window_context.py # Active window info (pywin32)
│ ├── input_injector.py # User-mode key injection
│ ├── driver_bridge.py # Kernel driver communication
│ ├── actions.py # Action parsing
│ ├── prompt.py # System prompts
│ └── kill_switch.py # Emergency stop
├── driver_stub/ # KMDF driver source
│ └── AikKmdfIoctl/
├── tools/
│ └── driver_ping.py # Driver test utility
└── requirements.txt
Kernel Driver (Advanced)
The driver stub in driver_stub/ provides kernel-level scancode injection that can bypass UIPI restrictions (type into UAC prompts, admin terminals, etc.).
Building the Driver
- Install Windows Driver Kit (WDK)
- Open
driver_stub/AikKmdfIoctl/in Visual Studio - Build for your target (x64 Release)
Loading the Driver (Test Mode)
# Enable test signing (requires reboot)
bcdedit /set testsigning on
# Load driver
sc create AikKmdf type= kernel binPath= "C:\path\to\AikKmdfIoctl.sys"
sc start AikKmdf
# Test connectivity
python tools/driver_ping.py
Driver IOCTLs
| IOCTL | Function |
|---|---|
IOCTL_AIK_PING |
Returns "PONG" |
IOCTL_AIK_ECHO |
Echoes input buffer |
IOCTL_AIK_INJECT_SCANCODE |
Inject single scancode |
IOCTL_AIK_INJECT_SCANCODES |
Inject scancode batch |
Safety
- Kill Switch:
Ctrl+Alt+Backspacestops the agent immediately - Dry Run: Test with
--dry-runbefore live execution - Max Steps: Agent stops after 40 steps by default
- No Mouse: Intentionally keyboard-only to limit scope
Troubleshooting
"Missing ANTHROPIC_API_KEY"
- Set the key in
.envor environment variable
Keys don't work in elevated apps
- Run the Python script as Administrator
- Or use the kernel driver for UIPI bypass
Driver won't load
- Enable test signing:
bcdedit /set testsigning on - Check DebugView for kernel logs
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kai_agent-0.1.0.tar.gz.
File metadata
- Download URL: kai_agent-0.1.0.tar.gz
- Upload date:
- Size: 59.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
078683d8aef4e6f623a81570a848e82cf7bf6052040bc7bf317d1c3563de52e1
|
|
| MD5 |
d8ace85dc7fa6e37853092f7a7ab2435
|
|
| BLAKE2b-256 |
9d65114b910d935a68095fd70f714bd36c72894ff91e913c5334693a5eb973fb
|
File details
Details for the file kai_agent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kai_agent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa6a2825e92fde440143829356cf573fb3b37b1e5eedf879d729d88d57afcd30
|
|
| MD5 |
deb356e47df2087a3f4bf7137301e4ee
|
|
| BLAKE2b-256 |
7b7375bbffc7c27ae2864ea50254cb82bdb99585753a408ff4e6f20a895b2409
|