Skip to main content

KAI Agent — AI-powered vision-based desktop automation using Claude Vision

Project description

AIK (AI Keyboard) - Vision-Based Keyboard Automation

A Windows-based AI agent that uses Claude Vision (Haiku 4.5) to understand your screen and perform keyboard-only automation tasks. The agent captures screenshots, analyzes them with AI, and executes keyboard actions to accomplish your goals.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    User Space (Python Agent)                 │
├─────────────────────────────────────────────────────────────┤
│  User Goal Input → Agent Controller (main.py)               │
│         ↓                                                    │
│    AI Logic Loop                                             │
│    ├── Window Manager (pywin32) ← Context                   │
│    ├── Vision Module (mss/PIL) ← Capture                    │
│    └── LLM Client (Anthropic) → Action Plan                 │
│         ↓                                                    │
│    Driver Interface (ctypes) → IOCTL (Scancodes)            │
└─────────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│                  Kernel Space (Ring 0)                       │
├─────────────────────────────────────────────────────────────┤
│    Kernel Keyboard Filter Driver (KMDF)                      │
│    └── Inject → Windows Input Stack (kbdclass)              │
│                        ↓                                     │
│    Target Environment: Any App / System Prompts (UAC)       │
└─────────────────────────────────────────────────────────────┘

Features

  • Vision-based AI: Uses Claude Vision to understand screen content
  • Keyboard-only automation: Executes type_text, key_press, hotkey actions
  • Kill switch: Press Ctrl+Alt+Backspace to stop immediately
  • User-mode injection: Works with most applications via SendInput
  • Kernel driver support (optional): For bypassing UIPI restrictions
  • History-aware memory: Persists step-by-step execution history (with screenshot context), summarizes older steps, and avoids immediate repeat-loops

Requirements

  • Windows 10/11 (64-bit)
  • Python 3.11+
  • Anthropic API key with vision access

Quick Start

1. Install dependencies

pip install mss pywin32 pynput httpx pillow python-dotenv

Or use the requirements file:

pip install -r requirements.txt

2. Configure API key

Edit .env file:

ANTHROPIC_API_KEY=your-api-key-here
ANTHROPIC_MODEL=claude-haiku-4-5-20251001

3. Run the agent

Dry-run (prints actions without executing):

python main.py --goal "Open Notepad and type Hello World" --dry-run

Live mode (actually types):

python main.py --goal "Type 'Hello World' and press Enter"

Elevated mode (type into admin apps):

python main.py --elevate --goal "Type: Hello from elevated context"

Note: Elevation still cannot interact with the UAC secure desktop or login screen.

Interactive terminal mode (re-enter goals without retyping full command):

python tools/interactive_run.py

Voice Terminal (Multilingual)

You can run the voice-to-terminal utility with multilingual speech recognition.

python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --languages "en-IN,hi-IN,ta-IN"

Enable AI fallback for natural Hindi/Hinglish instructions:

python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --languages "en-IN,hi-IN" --ai-command-map

For multi-step spoken tasks (for example, "open excel then type data save and email"), the tool now delegates to the main agent automatically:

python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --delegate-to-agent

Single-language usage:

python tools/voice_type_terminal.py --provider google --run-command --language "en-US"

Command-line Options

Option Default Description
--goal (required) What you want the agent to accomplish
--dry-run False Print actions without injecting keys
--max-steps 40 Maximum planning cycles
--interval 0.8 Seconds between planning cycles
--monitor 1 mss monitor index (1=primary)
--screenshot-max-width 1280 Downscale screenshots for API
--model claude-haiku-4-5-20251001 Anthropic model ID
--log-level INFO Logging verbosity

Action Schema

The AI returns JSON with keyboard actions:

{
  "actions": [
    {"type": "type_text", "text": "Hello World"},
    {"type": "key_press", "key": "enter"},
    {"type": "hotkey", "keys": ["ctrl", "s"]},
    {"type": "wait_ms", "ms": 500},
    {"type": "stop", "reason": "Task completed"}
  ]
}

Supported Actions

Action Fields Description
type_text text Type a string
key_press key Press a single key (enter, tab, f1-f24, a-z, 0-9)
hotkey keys Press key combo (["ctrl", "c"])
wait_ms ms Wait milliseconds (0-60000)
stop reason Stop the agent

History-Aware Agent Memory

The agent maintains an internal conversation history so it can remember what it already did across steps:

  • Keeps the original goal pinned
  • Stores per-step memory (observations, planned actions, executed actions, success/failure, timestamps)
  • Summarizes older steps to avoid token blowups (keeps recent steps with screenshots)
  • Performs conservative dedup (skips immediate repeat actions that just succeeded in the prior step)

Project Structure

├── main.py              # Entry point
├── aik/
│   ├── agent.py         # Main agent loop
│   ├── anthropic_client.py  # Claude API client
│   ├── capture.py       # Screen capture (mss)
│   ├── window_context.py    # Active window info (pywin32)
│   ├── input_injector.py    # User-mode key injection
│   ├── driver_bridge.py     # Kernel driver communication
│   ├── actions.py       # Action parsing
│   ├── prompt.py        # System prompts
│   └── kill_switch.py   # Emergency stop
├── driver_stub/         # KMDF driver source
│   └── AikKmdfIoctl/
├── tools/
│   └── driver_ping.py   # Driver test utility
└── requirements.txt

Kernel Driver (Advanced)

The driver stub in driver_stub/ provides kernel-level scancode injection that can bypass UIPI restrictions (type into UAC prompts, admin terminals, etc.).

Building the Driver

  1. Install Windows Driver Kit (WDK)
  2. Open driver_stub/AikKmdfIoctl/ in Visual Studio
  3. Build for your target (x64 Release)

Loading the Driver (Test Mode)

# Enable test signing (requires reboot)
bcdedit /set testsigning on

# Load driver
sc create AikKmdf type= kernel binPath= "C:\path\to\AikKmdfIoctl.sys"
sc start AikKmdf

# Test connectivity
python tools/driver_ping.py

Driver IOCTLs

IOCTL Function
IOCTL_AIK_PING Returns "PONG"
IOCTL_AIK_ECHO Echoes input buffer
IOCTL_AIK_INJECT_SCANCODE Inject single scancode
IOCTL_AIK_INJECT_SCANCODES Inject scancode batch

Safety

  • Kill Switch: Ctrl+Alt+Backspace stops the agent immediately
  • Dry Run: Test with --dry-run before live execution
  • Max Steps: Agent stops after 40 steps by default
  • No Mouse: Intentionally keyboard-only to limit scope

Troubleshooting

"Missing ANTHROPIC_API_KEY"

  • Set the key in .env or environment variable

Keys don't work in elevated apps

  • Run the Python script as Administrator
  • Or use the kernel driver for UIPI bypass

Driver won't load

  • Enable test signing: bcdedit /set testsigning on
  • Check DebugView for kernel logs

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kai_agent-0.1.0.tar.gz (59.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kai_agent-0.1.0-py3-none-any.whl (64.5 kB view details)

Uploaded Python 3

File details

Details for the file kai_agent-0.1.0.tar.gz.

File metadata

  • Download URL: kai_agent-0.1.0.tar.gz
  • Upload date:
  • Size: 59.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for kai_agent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 078683d8aef4e6f623a81570a848e82cf7bf6052040bc7bf317d1c3563de52e1
MD5 d8ace85dc7fa6e37853092f7a7ab2435
BLAKE2b-256 9d65114b910d935a68095fd70f714bd36c72894ff91e913c5334693a5eb973fb

See more details on using hashes here.

File details

Details for the file kai_agent-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kai_agent-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for kai_agent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa6a2825e92fde440143829356cf573fb3b37b1e5eedf879d729d88d57afcd30
MD5 deb356e47df2087a3f4bf7137301e4ee
BLAKE2b-256 7b7375bbffc7c27ae2864ea50254cb82bdb99585753a408ff4e6f20a895b2409

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page