KAI Agent — AI-powered vision-based desktop automation using Claude Vision

These details have not been verified by PyPI

Project links

Project description

AIK (AI Keyboard) - Vision-Based Keyboard Automation

A Windows-based AI agent that uses Claude Vision (Haiku 4.5) to understand your screen and perform keyboard-only automation tasks. The agent captures screenshots, analyzes them with AI, and executes keyboard actions to accomplish your goals.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    User Space (Python Agent)                 │
├─────────────────────────────────────────────────────────────┤
│  User Goal Input → Agent Controller (main.py)               │
│         ↓                                                    │
│    AI Logic Loop                                             │
│    ├── Window Manager (pywin32) ← Context                   │
│    ├── Vision Module (mss/PIL) ← Capture                    │
│    └── LLM Client (Anthropic) → Action Plan                 │
│         ↓                                                    │
│    Driver Interface (ctypes) → IOCTL (Scancodes)            │
└─────────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│                  Kernel Space (Ring 0)                       │
├─────────────────────────────────────────────────────────────┤
│    Kernel Keyboard Filter Driver (KMDF)                      │
│    └── Inject → Windows Input Stack (kbdclass)              │
│                        ↓                                     │
│    Target Environment: Any App / System Prompts (UAC)       │
└─────────────────────────────────────────────────────────────┘

Features

Vision-based AI: Uses Claude Vision to understand screen content
Keyboard-only automation: Executes type_text, key_press, hotkey actions
Kill switch: Press Ctrl+Alt+Backspace to stop immediately
User-mode injection: Works with most applications via SendInput
Kernel driver support (optional): For bypassing UIPI restrictions
History-aware memory: Persists step-by-step execution history (with screenshot context), summarizes older steps, and avoids immediate repeat-loops

Requirements

Windows 10/11 (64-bit)
Python 3.11+
Anthropic API key with vision access

Quick Start

1. Install dependencies

pip install mss pywin32 pynput httpx pillow python-dotenv

Or use the requirements file:

pip install -r requirements.txt

2. Configure API key

Edit .env file:

ANTHROPIC_API_KEY=your-api-key-here
ANTHROPIC_MODEL=claude-haiku-4-5-20251001

3. Run the agent

Dry-run (prints actions without executing):

python main.py --goal "Open Notepad and type Hello World" --dry-run

Live mode (actually types):

python main.py --goal "Type 'Hello World' and press Enter"

Elevated mode (type into admin apps):

python main.py --elevate --goal "Type: Hello from elevated context"

Note: Elevation still cannot interact with the UAC secure desktop or login screen.

Interactive terminal mode (re-enter goals without retyping full command):

python tools/interactive_run.py

Voice Terminal (Multilingual)

You can run the voice-to-terminal utility with multilingual speech recognition.

python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --languages "en-IN,hi-IN,ta-IN"

Enable AI fallback for natural Hindi/Hinglish instructions:

python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --languages "en-IN,hi-IN" --ai-command-map

For multi-step spoken tasks (for example, "open excel then type data save and email"), the tool now delegates to the main agent automatically:

python tools/voice_type_terminal.py --provider sarvam --run-command --continuous --delegate-to-agent

Single-language usage:

python tools/voice_type_terminal.py --provider google --run-command --language "en-US"

Command-line Options

Option	Default	Description
`--goal`	(required)	What you want the agent to accomplish
`--dry-run`	False	Print actions without injecting keys
`--max-steps`	40	Maximum planning cycles
`--interval`	0.8	Seconds between planning cycles
`--monitor`	1	mss monitor index (1=primary)
`--screenshot-max-width`	1280	Downscale screenshots for API
`--model`	claude-haiku-4-5-20251001	Anthropic model ID
`--log-level`	INFO	Logging verbosity

Action Schema

The AI returns JSON with keyboard actions:

{
  "actions": [
    {"type": "type_text", "text": "Hello World"},
    {"type": "key_press", "key": "enter"},
    {"type": "hotkey", "keys": ["ctrl", "s"]},
    {"type": "wait_ms", "ms": 500},
    {"type": "stop", "reason": "Task completed"}
  ]
}

Supported Actions

Action	Fields	Description
`type_text`	`text`	Type a string
`key_press`	`key`	Press a single key (enter, tab, f1-f24, a-z, 0-9)
`hotkey`	`keys`	Press key combo (["ctrl", "c"])
`wait_ms`	`ms`	Wait milliseconds (0-60000)
`stop`	`reason`	Stop the agent

History-Aware Agent Memory

The agent maintains an internal conversation history so it can remember what it already did across steps:

Keeps the original goal pinned
Stores per-step memory (observations, planned actions, executed actions, success/failure, timestamps)
Summarizes older steps to avoid token blowups (keeps recent steps with screenshots)
Performs conservative dedup (skips immediate repeat actions that just succeeded in the prior step)

Project Structure

├── main.py              # Entry point
├── aik/
│   ├── agent.py         # Main agent loop
│   ├── anthropic_client.py  # Claude API client
│   ├── capture.py       # Screen capture (mss)
│   ├── window_context.py    # Active window info (pywin32)
│   ├── input_injector.py    # User-mode key injection
│   ├── driver_bridge.py     # Kernel driver communication
│   ├── actions.py       # Action parsing
│   ├── prompt.py        # System prompts
│   └── kill_switch.py   # Emergency stop
├── driver_stub/         # KMDF driver source
│   └── AikKmdfIoctl/
├── tools/
│   └── driver_ping.py   # Driver test utility
└── requirements.txt

Kernel Driver (Advanced)

The driver stub in driver_stub/ provides kernel-level scancode injection that can bypass UIPI restrictions (type into UAC prompts, admin terminals, etc.).

Building the Driver

Install Windows Driver Kit (WDK)
Open driver_stub/AikKmdfIoctl/ in Visual Studio
Build for your target (x64 Release)

Loading the Driver (Test Mode)

# Enable test signing (requires reboot)
bcdedit /set testsigning on

# Load driver
sc create AikKmdf type= kernel binPath= "C:\path\to\AikKmdfIoctl.sys"
sc start AikKmdf

# Test connectivity
python tools/driver_ping.py

Driver IOCTLs

IOCTL	Function
`IOCTL_AIK_PING`	Returns "PONG"
`IOCTL_AIK_ECHO`	Echoes input buffer
`IOCTL_AIK_INJECT_SCANCODE`	Inject single scancode
`IOCTL_AIK_INJECT_SCANCODES`	Inject scancode batch

Safety

Kill Switch: Ctrl+Alt+Backspace stops the agent immediately
Dry Run: Test with --dry-run before live execution
Max Steps: Agent stops after 40 steps by default
No Mouse: Intentionally keyboard-only to limit scope

Troubleshooting

"Missing ANTHROPIC_API_KEY"

Set the key in .env or environment variable

Keys don't work in elevated apps

Run the Python script as Administrator
Or use the kernel driver for UIPI bypass

Driver won't load

Enable test signing: bcdedit /set testsigning on
Check DebugView for kernel logs

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Feb 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kai_agent-0.1.0.tar.gz (59.5 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kai_agent-0.1.0-py3-none-any.whl (64.5 kB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file kai_agent-0.1.0.tar.gz.

File metadata

Download URL: kai_agent-0.1.0.tar.gz
Upload date: Feb 15, 2026
Size: 59.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for kai_agent-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`078683d8aef4e6f623a81570a848e82cf7bf6052040bc7bf317d1c3563de52e1`
MD5	`d8ace85dc7fa6e37853092f7a7ab2435`
BLAKE2b-256	`9d65114b910d935a68095fd70f714bd36c72894ff91e913c5334693a5eb973fb`

See more details on using hashes here.

File details

Details for the file kai_agent-0.1.0-py3-none-any.whl.

File metadata

Download URL: kai_agent-0.1.0-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 64.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for kai_agent-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa6a2825e92fde440143829356cf573fb3b37b1e5eedf879d729d88d57afcd30`
MD5	`deb356e47df2087a3f4bf7137301e4ee`
BLAKE2b-256	`7b7375bbffc7c27ae2864ea50254cb82bdb99585753a408ff4e6f20a895b2409`

See more details on using hashes here.

kai-agent 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AIK (AI Keyboard) - Vision-Based Keyboard Automation

Architecture

Features

Requirements

Quick Start

1. Install dependencies

2. Configure API key

3. Run the agent

Voice Terminal (Multilingual)

Command-line Options

Action Schema

Supported Actions

History-Aware Agent Memory

Project Structure

Kernel Driver (Advanced)

Building the Driver

Loading the Driver (Test Mode)

Driver IOCTLs

Safety

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes