Skip to main content

Vision-powered desktop automation framework with OCR text recognition using Tesseract.

Project description

██╗  ██╗██████╗  ██████╗ ███╗   ██╗███████╗████████╗███████╗███████╗███╗   ██╗
██║ ██╔╝██╔══██╗██╔═══██╗████╗  ██║██╔════╝╚══██╔══╝██╔════╝██╔════╝████╗  ██║
█████╔╝ ██████╔╝██║   ██║██╔██╗ ██║███████╗   ██║   █████╗  █████╗  ██╔██╗ ██║
██╔═██╗ ██╔══██╗██║   ██║██║╚██╗██║╚════██║   ██║   ██╔══╝  ██╔══╝  ██║╚██╗██║
██║  ██╗██║  ██║╚██████╔╝██║ ╚████║███████║   ██║   ███████╗███████╗██║ ╚████║
╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝ ╚═╝  ╚═══╝╚══════╝   ╚═╝   ╚══════╝╚══════╝╚═╝  ╚═══╝

"The perfect automation is invisible"

Vision-Powered Desktop Automation Framework

Python 3.8+ License: MIT Cross-Platform


🎯 Mission Brief

Inspired by SPECTRE's #5, the master planner from Ian Fleming's From Russia with Love, Kronsteen is your strategic automation framework. Like its namesake, it operates with precision, intelligence, and flawless execution.

Kronsteen combines computer vision (OCR) with human-like automation to interact with any desktop application—no API required. It sees what you see, clicks what you click, and types what you type.

Why Kronsteen?

  • 🎯 Vision-First - Uses OCR to find and interact with UI elements
  • 🚀 Universal - Works with any application, any platform
  • 🧠 Intelligent - Template matching, window focus monitoring, smart retries
  • Fast - Tesseract OCR processes screens in ~100ms
  • 🛡️ Reliable - Built-in error handling and logging
  • 🌍 Cross-Platform - macOS, Windows, Linux support
  • 📐 Resolution-Independent - Works on any screen size or DPI automatically

✨ Key Features

🎭 Core Capabilities

Feature Description
🔍 OCR Text Finding Find and click text anywhere on screen using Tesseract OCR
🤖 AI Vision (NEW!) YOLO object detection, segmentation & classification
🖼️ Template Matching Match images and click on them with confidence thresholds
🚀 Universal Launcher Launch apps by name on any platform (no paths needed)
🖱️ Mouse & Keyboard Full control with human-like timing and movements
🪟 Window Monitoring Pause automation when target window loses focus
📸 Smart Screenshots Capture screens with automatic Retina display scaling
🎨 Color Detection Find UI elements by color patterns
📊 Logging System Automatic logging with optional screenshot capture
⚙️ Configurable Timeouts, retries, confidence levels, and more

🧩 Framework Architecture

kronsteen/
├── 🎯 client.py              # Main orchestrator
├── 🔍 ocr_tesseract.py       # Tesseract OCR engine (Retina support)
├── 🖼️ ocr.py                 # DeepSeek OCR engine (GPU/CPU)
├── 🤖 vision.py              # YOLO object detection & segmentation
├── 🎪 finders.py             # Text/image/template finding
├── 🎬 actions.py             # Mouse/keyboard automation
├── 🚀 launcher.py            # Cross-platform app launcher
├── 🪟 window_monitor.py      # Window focus tracking
├── 📝 logging_config.py      # Logging & screenshots
├── 🎨 models.py              # Data structures
└── ⚙️ config.py              # Configuration management

🚀 Installation

Two simple steps:

1. Install Tesseract OCR

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt update && sudo apt install tesseract-ocr

# Windows
# Download installer: https://github.com/UB-Mannheim/tesseract/wiki

2. Install Kronsteen

pip install kronsteen

That's it! One command installs everything:

Core Automation:

  • pyautogui - Mouse and keyboard automation
  • pytesseract - Python wrapper for Tesseract
  • opencv-python - Computer vision and template matching
  • Pillow - Image processing
  • numpy - Numerical operations

AI Vision (Included!):

  • ultralytics - YOLO models (YOLOv8, YOLOv9, YOLOv10, YOLOv11)
  • torch - PyTorch deep learning framework
  • roboflow - Custom model integration

Done! OCR + AI Vision + Automation in one package!


⚡ Quick Start

Your First Mission

import kronsteen

# Setup logging (optional)
kronsteen.setup_logging(enable_screenshots=False)

# Launch Chrome - works on all platforms!
kronsteen.launch("Chrome")

# Wait for page to load using OCR
kronsteen.wait_for_text("Google", timeout=10)

# Click on text found by OCR
kronsteen.click_on_text("Search", match_mode="contains")

# Type like a human
kronsteen.type_text("Hello World", press_enter=True)

# Mission accomplished! 🎯

30-Second Demo

import kronsteen

# Configure for your mission
kronsteen.configure(default_timeout=20)

# Launch target application
kronsteen.launch("Chrome")
kronsteen.sleep(2)

# Use OCR to find and interact
match = kronsteen.find_text("Sign In")
print(f"Found at: {match.region.center()}")

# Click on it
kronsteen.click_on_text("Sign In")

# Type credentials
kronsteen.type_text("agent007@mi6.gov.uk")
kronsteen.press("tab")
kronsteen.type_text("martini_shaken")
kronsteen.press("enter")

🤖 AI Vision (NEW!)

Kronsteen now includes state-of-the-art computer vision powered by YOLO models!

🎯 Object Detection

Detect and interact with any object on screen:

import kronsteen

# Initialize vision client
vision = kronsteen.VisionClient(
    model="yolov8n.pt",  # Fast, lightweight model
    task="detect",
    confidence_threshold=0.5
)

# Take screenshot
screenshot = kronsteen.screenshot()

# Detect all objects
detections = vision.detect(screenshot)
for det in detections:
    print(f"{det.class_name}: {det.confidence:.2%} at {det.center}")

# Detect specific objects
people = vision.detect(screenshot, classes=["person", "car"])
print(f"Found {len(people)} people or cars")

# Find and click on object
if vision.find_and_click(screenshot, "button", min_confidence=0.7):
    print("✅ Clicked on button!")

✂️ Image Segmentation

Get pixel-perfect masks for objects:

# Initialize segmentation model
vision = kronsteen.VisionClient(
    model="yolov8n-seg.pt",
    task="segment"
)

# Segment objects
segments = vision.segment(screenshot, classes=["person"])
for seg in segments:
    print(f"{seg.class_name} - Mask shape: {seg.mask.shape}")
    # Use seg.mask for pixel-perfect interaction

🏷️ Image Classification

Classify entire screenshots or regions:

# Initialize classification model
vision = kronsteen.VisionClient(
    model="yolov8n-cls.pt",
    task="classify"
)

# Classify image (get top 3 predictions)
results = vision.classify(screenshot, top_k=3)
for result in results:
    print(f"{result.class_name}: {result.confidence:.2%}")

🎨 Custom Models (Roboflow)

Use your own trained models:

# Set your Roboflow API key
import os
os.environ["ROBOFLOW_API_KEY"] = "your_api_key"

# Load custom model
vision = kronsteen.VisionClient(
    model="workspace/project/1",  # Your Roboflow model
    task="detect"
)

# Detect custom objects
detections = vision.detect(screenshot)

⚡ Quick Vision Functions

For one-off detections:

# Quick object detection
detections = kronsteen.detect_objects(
    screenshot,
    model="yolov8n.pt",
    classes=["person", "car"],
    confidence=0.5
)

# Quick segmentation
segments = kronsteen.segment_objects(
    screenshot,
    model="yolov8n-seg.pt"
)

# Quick classification
results = kronsteen.classify_image(
    screenshot,
    model="yolov8n-cls.pt",
    top_k=3
)

🎯 Available YOLO Models

Detection Models:

  • yolov8n.pt - Nano (6MB, fastest)
  • yolov8s.pt - Small (22MB)
  • yolov8m.pt - Medium (52MB)
  • yolov8l.pt - Large (88MB)
  • yolov8x.pt - Extra Large (136MB, most accurate)

Segmentation Models:

  • yolov8n-seg.pt - Nano segmentation
  • yolov8s-seg.pt - Small segmentation
  • yolov8m-seg.pt - Medium segmentation

Classification Models:

  • yolov8n-cls.pt - Nano classification
  • yolov8s-cls.pt - Small classification

COCO Classes (80 objects): person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, stop sign, parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, backpack, umbrella, handbag, tie, suitcase, frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, chair, couch, potted plant, bed, dining table, toilet, tv, laptop, mouse, remote, keyboard, cell phone, microwave, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair drier, toothbrush


📚 Complete Guide

🔍 OCR Text Finding

Kronsteen's vision system can find any text on screen:

# Find text with OCR
match = kronsteen.find_text("Login")
print(f"Found '{match.text}' at {match.region.center()}")
print(f"Confidence: {match.confidence}")

# Find all text on screen
all_matches = kronsteen.find_all_text(None)
for match in all_matches:
    print(f"- {match.text}")

# Click on text
kronsteen.click_on_text("Submit", match_mode="contains")

# Wait for text to appear
kronsteen.wait_for_text("Welcome", timeout=30)

# Wait for text to disappear
kronsteen.wait_for_text_to_disappear("Loading...", timeout=10)

# Search in specific region only
match = kronsteen.find_text(
    "Button",
    region=(0, 0, 500, 500),  # Top-left quadrant
    min_confidence=0.8
)

Match Modes:

  • "contains" - Text contains the query (default)
  • "equals" - Exact match
  • "starts-with" - Text starts with query
  • "regex" - Regular expression match

🖼️ Template Matching

Find and click images on screen:

# Find template image
match = kronsteen.find_template(
    "button.png",
    confidence=0.8,
    grayscale=True
)

# Wait for template to appear
match = kronsteen.wait_for_template(
    "loading_icon.png",
    timeout=10
)

# Find and click in one step
match = kronsteen.click_on_template(
    "submit_button.png",
    confidence=0.9
)

🚀 Universal Launcher

Launch apps by name—no paths needed:

# Launch by name (cross-platform)
kronsteen.launch("Chrome")    # Works everywhere
kronsteen.launch("Safari")    # macOS
kronsteen.launch("Firefox")   # All platforms
kronsteen.launch("Terminal")  # macOS/Linux

# Launch with arguments
kronsteen.launch("Chrome", args=["--incognito"])

# Find app path
path = kronsteen.find_application("Chrome")
print(f"Chrome is at: {path}")

# Close app when done
kronsteen.close_app("Chrome")

🪟 Window Focus Monitoring

Pause automation when target window loses focus:

# Start monitoring Chrome window
monitor = kronsteen.start_window_monitoring(
    window_name="Chrome",
    check_interval=0.5  # Check every 0.5s
)

# Automation pauses if Chrome loses focus
kronsteen.click_on_text("Button")  # Pauses if Chrome not active
kronsteen.type_text("Hello")       # Resumes when Chrome regains focus

# Stop monitoring
kronsteen.stop_window_monitoring()

🖱️ Mouse Control

# Click
kronsteen.click(x=100, y=200)
kronsteen.double_click(x=100, y=200)
kronsteen.right_click(x=100, y=200)

# Move mouse
kronsteen.move_to(x=500, y=300, duration=0.5)

# Drag
kronsteen.click_and_drag(
    start_x=100, start_y=100,
    end_x=500, end_y=500,
    duration=1.0
)

# Scroll
kronsteen.scroll(clicks=5)   # Scroll down
kronsteen.scroll(clicks=-5)  # Scroll up

⌨️ Keyboard Control

# Type text
kronsteen.type_text("Hello World")
kronsteen.type_text("Search query", press_enter=True)

# Press keys
kronsteen.press("enter")
kronsteen.press("tab")
kronsteen.press("escape")

# Hotkeys (keyboard shortcuts)
kronsteen.hotkey("command", "c")  # Copy on macOS
kronsteen.hotkey("ctrl", "c")     # Copy on Windows/Linux
kronsteen.hotkey("command", "l")  # Focus address bar

📸 Screenshots & Colors

# Capture full screen
img = kronsteen.screenshot()

# Capture region
img = kronsteen.screenshot(region=(0, 0, 500, 500))

# Save screenshot
kronsteen.save_screenshot("screenshot.png")

# Find color on screen
match = kronsteen.find_color(
    color=(255, 0, 0),  # RGB red
    tolerance=10
)

📝 Logging & Configuration

# Setup logging with screenshots
kronsteen.setup_logging(
    log_dir="logs",
    enable_screenshots=True
)

# Get logger
logger = kronsteen.get_logger()
logger.info("Starting automation")

# Configure global settings
kronsteen.configure(
    default_timeout=20,
    retry_interval=0.5,
    fail_safe=True,
    default_pause=0.1
)

# Switch OCR engines
kronsteen.use_ocr_engine("tesseract")  # Fast (default)
kronsteen.use_ocr_engine("deepseek")   # Accurate (GPU)

🎬 Real-World Examples

Example 1: Web Automation

"""Automate Google search."""
import kronsteen

# Setup
kronsteen.setup_logging()
kronsteen.configure(default_timeout=25)

# Launch Chrome
kronsteen.launch("Chrome")
kronsteen.sleep(3)

# Wait for Google to load
kronsteen.wait_for_text("Google", timeout=30)

# Focus address bar and search
kronsteen.hotkey("command", "l")  # Cmd+L on macOS
kronsteen.sleep(0.5)
kronsteen.type_text("Kronsteen automation", press_enter=True)

# Wait for results
kronsteen.sleep(3)
print("✓ Search completed!")

Example 2: Form Filling

"""Fill out a web form."""
import kronsteen

# Find and fill form fields
kronsteen.click_on_text("Email")
kronsteen.type_text("agent@mi6.gov.uk")

kronsteen.press("tab")  # Move to next field
kronsteen.type_text("SecretPassword123")

kronsteen.press("tab")
kronsteen.type_text("James Bond")

# Submit
kronsteen.click_on_text("Submit")
kronsteen.wait_for_text("Success", timeout=10)
print("✓ Form submitted!")

Example 3: Multi-Step Workflow

"""Complete multi-step automation workflow."""
import kronsteen

def automate_workflow():
    # Setup
    kronsteen.setup_logging(enable_screenshots=True)
    logger = kronsteen.get_logger()
    
    try:
        # Step 1: Launch application
        logger.info("Step 1: Launching application")
        kronsteen.launch("Chrome")
        kronsteen.sleep(2)
        
        # Step 2: Navigate
        logger.info("Step 2: Navigating to site")
        kronsteen.hotkey("command", "l")
        kronsteen.type_text("https://example.com", press_enter=True)
        
        # Step 3: Wait for page load
        logger.info("Step 3: Waiting for page load")
        kronsteen.wait_for_text("Welcome", timeout=30)
        
        # Step 4: Interact with UI
        logger.info("Step 4: Clicking login")
        kronsteen.click_on_text("Login")
        
        # Step 5: Fill credentials
        logger.info("Step 5: Entering credentials")
        kronsteen.type_text("username")
        kronsteen.press("tab")
        kronsteen.type_text("password")
        kronsteen.press("enter")
        
        # Step 6: Verify success
        logger.info("Step 6: Verifying login")
        kronsteen.wait_for_text("Dashboard", timeout=20)
        
        logger.info("✓ Workflow completed successfully!")
        return True
        
    except Exception as e:
        logger.error(f"✗ Workflow failed: {e}")
        return False
    
    finally:
        # Cleanup
        kronsteen.close_app("Chrome")

if __name__ == "__main__":
    success = automate_workflow()
    exit(0 if success else 1)

🌍 Platform Support

macOS

  • Retina Display Support - Automatic coordinate scaling
  • Universal Launcher - .app bundle detection
  • Spotlight Integration - Fallback app search
  • AppleScript Support - Window management

Windows

  • Program Files Search - Auto-detect installed apps
  • System PATH - Command-line app support
  • Registry Integration - Browser detection
  • PowerShell Support - Window management

Linux

  • Standard Directories - /usr/bin, /usr/local/bin
  • Snap/Flatpak - Modern package format support
  • Desktop Files - .desktop file integration
  • xdotool/wmctrl - Window management

⚡ Performance

Feature Speed Notes
Tesseract OCR ~100ms Fast, CPU-based
DeepSeek OCR ~500ms (GPU) / ~5s (CPU) Accurate, GPU recommended
Screenshot ~10ms Instant capture
Template Match ~50-200ms Depends on image size
Mouse/Keyboard Instant PyAutoGUI
App Launch ~1-3s Platform dependent

Optimization Tips

  • ✅ Use Tesseract for speed (default)
  • ✅ Use DeepSeek for accuracy (GPU required)
  • ✅ Specify regions to limit search area
  • ✅ Use template matching for repeated UI elements
  • ✅ Enable window monitoring to prevent errors
  • ✅ Cache app paths for faster launches

🔧 Troubleshooting

Tesseract Not Found

The tesseract package should bundle the binary automatically. If you still get errors:

Option 1: Reinstall

pip uninstall kronsteen tesseract pytesseract
pip install kronsteen

Option 2: System Installation (fallback)

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Verify:

import pytesseract
print(pytesseract.get_tesseract_version())

Text Not Found

# Lower confidence threshold
match = kronsteen.find_text("text", min_confidence=0.5)

# Use different match mode
match = kronsteen.find_text("text", match_mode="contains")

# Search in specific region
match = kronsteen.find_text("text", region=(0, 0, 500, 500))

# Try different OCR engine
kronsteen.use_ocr_engine("deepseek")  # More accurate

Retina Display Issues

Kronsteen automatically handles Retina scaling. To verify:

from kronsteen.ocr_tesseract import TesseractOCRClient
ocr = TesseractOCRClient()
print(f"Scale factor: {ocr.scale_factor}")  # Should be 2.0 on Retina

Different Screen Resolutions

How Kronsteen handles different screens:

Works automatically:

  • Different screen sizes (1920x1080, 2560x1440, 4K, etc.)
  • Retina vs non-Retina displays
  • Multiple monitors (uses active screen)
  • Dynamic resolution changes

How it works:

  1. OCR reads text from current screen in real-time
  2. Coordinates are relative to current screen size
  3. No hardcoded positions - everything is dynamic

Example:

# This works on ANY screen resolution
kronsteen.click_on_text("Login")  # Finds "Login" wherever it is

# Screen size is detected automatically
width, height = kronsteen.get_screen_size()
print(f"Your screen: {width}x{height}")

⚠️ Limitation: Template Matching Pre-captured template images may not match on different resolutions. Solution:

# Use OCR instead of templates for cross-resolution compatibility
kronsteen.click_on_text("Button")  # ✅ Works on any resolution

# Or capture templates at runtime
template = kronsteen.screenshot(region=(100, 100, 200, 150))
kronsteen.click_on_template(template)  # ✅ Works

Window Focus Not Working

# Check if window name is correct
active = kronsteen.get_active_window_title()
print(f"Active window: {active}")

# Use partial match
kronsteen.start_window_monitoring("Chrome", partial_match=True)

📁 Examples

Check out the examples/ directory for complete working examples:

  • example.py - Google search automation with window monitoring and OCR

🎓 Why Kronsteen?

The SPECTRE Connection

Named after Kronsteen, SPECTRE's #5 and master strategist from Ian Fleming's From Russia with Love. Like the chess grandmaster who planned the perfect operation, this framework executes automation with precision and intelligence.

"The plan is perfect. The execution will be flawless." - Kronsteen

Why This Framework?

  • 🎯 No API Required - Works with any application
  • 🧠 Vision-Based - Sees the UI like a human
  • 🚀 Fast Development - Write automation in minutes
  • 🛡️ Reliable - Built-in error handling and retries
  • 🌍 Universal - One codebase, all platforms
  • 📚 Well-Documented - Clear examples and guides

Why Tesseract OCR?

  • Fast - ~100ms per screenshot
  • Accurate - Industry-standard since 1985
  • Portable - Bundle binary with your app
  • Multi-language - Supports 100+ languages
  • Lightweight - ~10MB binary + language data
  • Free - Open source, Apache License 2.0
  • Battle-tested - Used by Google, Microsoft, and more

🤝 Contributing

Contributions are welcome! Whether it's:

  • 🐛 Bug reports
  • 💡 Feature requests
  • 📝 Documentation improvements
  • 🔧 Code contributions

Please feel free to open issues and pull requests.


📄 License

MIT License - see LICENSE file for details.


🙏 Credits

Built With:

Inspired By:

  • Ian Fleming's From Russia with Love
  • SPECTRE's master planner, Kronsteen
  • The need for intelligent, vision-based automation

"The perfect automation is invisible"

Made with ❤️ by Roman Klym

Star ⭐ this repo if you find it useful!

Report Bug · Request Feature · Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kronsteen-0.2.0.tar.gz (43.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kronsteen-0.2.0-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file kronsteen-0.2.0.tar.gz.

File metadata

  • Download URL: kronsteen-0.2.0.tar.gz
  • Upload date:
  • Size: 43.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kronsteen-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0d6602e4054396202744daacd3e2b84105832aa39529db317eb2c70d4b4ea121
MD5 858eacfd19d62c3a227b6892d7390e28
BLAKE2b-256 492657b8185329424d0f0c8d83b412edfedff055f36d05e6b1d2ac209cf58247

See more details on using hashes here.

File details

Details for the file kronsteen-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: kronsteen-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kronsteen-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 849086ae4f4e28396df2402df20b9bf64586c353c91992ee89a48c83dd6d903a
MD5 fbfd1c7f45a86d78eb7a4441545d8721
BLAKE2b-256 0c11aea5ad3c2aa7078b9f14b8e729f5b2cf8e9765488962f2687ce54612bc60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page