Skip to main content

Everything you need to build state-of-the-art foundation multimodal desktop agent, end-to-end.

Project description

Open World Agents

๐Ÿš€ Open World Agents

Everything you need to build state-of-the-art foundation multimodal desktop agent, end-to-end.

Documentation License: MIT Python 3.11+ GitHub stars

โš ๏ธ Active Development Notice: This codebase is under active development. APIs and components may change, and some may be moved to separate repositories. Documentation may be incomplete or reference features still in development.

๐Ÿ“„ Research Paper: This project was first introduced and developed for the D2E project. For more details, see D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI. If you find this work useful, please cite our paper.

๐Ÿš€ Quick Start: Record โ†’ Train in 3 Steps

# 1. Record desktop interaction
$ ocap my-session.mcap

# 2. Process to training format
$ python scripts/01_raw_events_to_event_dataset.py --train-dir ./

# 3. Train your model
$ python train.py --dataset ./event-dataset

๐Ÿ“– Detailed Guide: Complete Quick Start Tutorial - Step-by-step walkthrough with examples and troubleshooting

Overview

Open World Agents is a comprehensive framework for building AI agents that interact with desktop applications through vision, keyboard, and mouse control. Complete toolkit from data capture to model training and evaluation:

  • ๐ŸŒ Environment Framework: "USB-C of desktop agents" - universal interface for native desktop automation with pre-built plugins for desktop control, high-performance screen capture (6x faster), and zero-configuration plugin system
  • ๐Ÿ“Š Data Infrastructure: Complete desktop agent data pipeline from recording to training with OWAMcap format - a universal standard powered by mcap
  • ๐Ÿ› ๏ธ CLI Tools: Command-line utilities (owl) for recording, analyzing, and managing agent data
  • ๐Ÿค– Examples: Complete implementations and training pipelines for multimodal agents

Why OWA?

Fragmented tools make desktop AI development painful. Most solutions force you to:

  • Stitch together incompatible recording tools
  • Build custom data pipelines from scratch
  • Handle real-time performance issues yourself
  • Start agent development with no examples

OWA solves this with a unified framework: record with ocap, train with standardized datasets, deploy with real-time environment components, and learn from community examples.

What Can You Build?

Anything that runs on desktop. If a human can do it on a computer, you can build an AI agent to automate it.

๐Ÿค– Desktop Automation: Navigate applications, automate workflows, interact with any software
๐ŸŽฎ Game AI: Master complex games through visual understanding and real-time decision making
๐Ÿ“Š Training Datasets: Capture high-quality human-computer interaction data for foundation models
๐Ÿค— Community Datasets: Access and contribute to growing OWAMcap datasets on HuggingFace
๐Ÿ“ˆ Benchmarks: Create and evaluate desktop agent performance across diverse tasks

Project Structure

The repository is organized as a monorepo with multiple sub-repositories under the projects/ directory. Each sub-repository is a self-contained Python package installable via pip or uv and follows namespace packaging conventions.

open-world-agents/
โ”œโ”€โ”€ projects/
โ”‚   โ”œโ”€โ”€ mcap-owa-support/     # OWAMcap format support
โ”‚   โ”œโ”€โ”€ owa-core/             # Core framework and registry system
โ”‚   โ”œโ”€โ”€ owa-msgs/             # Core message definitions with automatic discovery
โ”‚   โ”œโ”€โ”€ owa-cli/              # Command-line tools (ocap, owl)
โ”‚   โ”œโ”€โ”€ owa-env-desktop/      # Desktop environment plugin
โ”‚   โ”œโ”€โ”€ owa-env-example/      # Example environment implementations
โ”‚   โ”œโ”€โ”€ owa-env-gst/          # GStreamer-based screen capture
โ”‚   โ””โ”€โ”€ [your-plugin]/        # Contribute your own plugins!
โ”œโ”€โ”€ docs/                     # Documentation
โ””โ”€โ”€ README.md

Core Packages

owa owa

The easiest way to get started is to install the owa meta-package, which includes all core components and environment plugins:

pip install owa

All OWA packages use namespace packaging and are installed in the owa namespace (e.g., owa.core, owa.cli, owa.env.desktop). For more detail, see Packaging namespace packages. We recommend using uv as the package manager.

Name Release in PyPI Conda Description
owa.core owa-core owa-core Framework foundation with registry system
owa.msgs owa-msgs owa-msgs Core message definitions with automatic discovery
owa.cli owa-cli owa-cli Command-line tools (owl) for data analysis
mcap-owa-support mcap-owa-support mcap-owa-support OWAMcap format support and utilities
ocap ๐ŸŽฅ ocap ocap Desktop recorder for multimodal data capture
owa.env.desktop owa-env-desktop owa-env-desktop Mouse, keyboard, window event handling
owa.env.gst ๐ŸŽฅ owa-env-gst owa-env-gst GStreamer-powered screen capture (6x faster)
owa.env.example - - Reference implementations for learning

๐ŸŽฅ Video Processing Packages: Packages marked with ๐ŸŽฅ require GStreamer dependencies. Install conda install open-world-agents::gstreamer-bundle first for full functionality.

๐Ÿ“ฆ Lockstep Versioning: All first-party OWA packages follow lockstep versioning, meaning they share the same version number to ensure compatibility and simplify dependency management.

๐Ÿ’ก Extensible Design: Built for the community! Easily create custom plugins like owa-env-minecraft or owa-env-web to extend functionality.

Community Packages

Help us grow the ecosystem! ๐ŸŒฑ Community-contributed environment plugins extend OWA's capabilities to specialized domains.

Example plugin ideas from the community:

Example Name Description
owa.env.minecraft Minecraft automation & bot framework
owa.env.web Browser automation via WebDriver
owa.env.mobile Android/iOS device control
owa.env.cad CAD software automation (AutoCAD, SolidWorks)
owa.env.trading Financial trading platform integration

๐Ÿ’ก Want to contribute? Check our Plugin Development Guide to create your own owa.env.* package!

๐Ÿ’ญ These are just examples! The community decides what plugins to build. Propose your own ideas or create plugins for any domain you're passionate about.

Desktop Recording with ocap

ocap (Omnimodal CAPture) is a high-performance desktop recorder that captures screen video, audio, keyboard/mouse events, and window events in synchronized formats. Built with Windows APIs and GStreamer for hardware-accelerated recording with H265/HEVC encoding.

  • Complete recording: Video + audio + keyboard/mouse + window events
  • High performance: Hardware-accelerated, ~100MB/min for 1080p
  • Simple usage: ocap my-recording (stop with Ctrl+C)
  • Modern formats: OWAMcap with flexible MediaRef system (supports MKV, images, URLs, embedded data)

๐Ÿ“– Detailed Documentation: See Desktop Recording Guide for complete setup, usage examples, and troubleshooting.

Quick Start

Environment Usage: Three Types of Components

OWA's Environment provides three types of components for real-time agent interaction:

Callables - Direct function calls for immediate actions

from owa.core import CALLABLES
# Components automatically available - zero configuration!

# Get current time, capture screen, click mouse
current_time = CALLABLES["std/time_ns"]()
screen = CALLABLES["desktop/screen.capture"]()
CALLABLES["desktop/mouse.click"]("left", 2)  # Double-click

Listeners - Event monitoring with user-defined callbacks

from owa.core import LISTENERS
import time

# Monitor keyboard events
def on_key(event):
    print(f"Key pressed: {event.vk}")

listener = LISTENERS["desktop/keyboard"]().configure(callback=on_key)
with listener.session:
    input("Press Enter to stop...")

# Periodic tasks
def on_tick():
    print(f"Tick: {CALLABLES['std/time_ns']()}")

with LISTENERS["std/tick"]().configure(callback=on_tick, interval=1).session:
    time.sleep(3)  # Prints every second for 3 seconds

Runnables - Background processes that can be started/stopped

from owa.core import RUNNABLES

# Periodic screen capture
capture = RUNNABLES["gst/screen_capture"]().configure(fps=60)
with capture.session:
    frame = capture.grab()

Message Types - Access structured message definitions

from owa.core import MESSAGES

# Message types automatically available
KeyboardEvent = MESSAGES["desktop/KeyboardEvent"]
ScreenCaptured = MESSAGES["desktop/ScreenCaptured"]

High-Performance Screen Capture

import time
from owa.core import CALLABLES, LISTENERS, MESSAGES

# Components and messages automatically available - no activation needed!

def on_screen_update(frame, metrics):
    print(f"๐Ÿ“ธ New frame: {frame.frame_arr.shape}")
    print(f"โšก Latency: {metrics.latency*1000:.1f}ms")

    # Access screen message type from registry
    ScreenCaptured = MESSAGES['desktop/ScreenCaptured']
    print(f"Frame message type: {ScreenCaptured}")

# Start real-time screen capture
screen = LISTENERS["gst/screen"]().configure(
    callback=on_screen_update, fps=60, show_cursor=True
)

with screen.session:
    print("๐ŸŽฏ Agent is watching your screen...")
    time.sleep(5)

Powered by the powerful Gstreamer and Windows API, our implementation is 6x faster than comparatives.

Library Avg. Time per Frame Relative Speed
owa.env.gst 5.7 ms โšก 1ร— (Fastest)
pyscreenshot 33 ms ๐Ÿšถโ€โ™‚๏ธ 5.8ร— slower
PIL 34 ms ๐Ÿšถโ€โ™‚๏ธ 6.0ร— slower
MSS 37 ms ๐Ÿšถโ€โ™‚๏ธ 6.5ร— slower
PyQt5 137 ms ๐Ÿข 24ร— slower

๐Ÿ“Œ Tested on: Intel i5-11400, GTX 1650

Not only does owa.env.gst achieve higher FPS, but it also maintains lower CPU/GPU usage, making it the ideal choice for screen recording. Same applies for ocap, since it internally imports owa.env.gst.

๐Ÿ“Š See detailed benchmarks and methodology โ†’

Desktop Recording & Dataset Sharing

Record your desktop usage data and share with the community:

# Install GStreamer dependencies (for video recording) and ocap
conda install open-world-agents::gstreamer-bundle && pip install ocap

# Record desktop activity (includes video, audio, events)
ocap my-session

# Upload to HuggingFace, browse community datasets!
# Visit: https://huggingface.co/datasets?other=OWA

๐Ÿค— Community Datasets: Democratizing Desktop Agent Data

Browse Available Datasets: ๐Ÿค— datasets?other=OWA

  • Growing Collection: Hundreds of community-contributed datasets
  • Standardized Format: All use OWAMcap for seamless integration
  • Interactive Preview: Hugging Face Spaces Visualizer
  • Easy Sharing: Upload recordings directly with one command

๐Ÿš€ Impact: OWA has democratized desktop agent data, growing from zero to hundreds of public datasets in the unified OWAMcap format.

Access Community Datasets:

# Load datasets from HuggingFace
from owa.data import load_dataset

# Browse available OWAMcap datasets
datasets = load_dataset.list_available(format="OWA")

# Load a specific dataset
data = load_dataset("open-world-agents/example_dataset")

Data Format Preview

OWAMcap combines the robustness of MCAP with specialized desktop interaction schemas. Perfect synchronization of screen captures, input events, and window context with nanosecond precision.

Key Features:

  • ๐Ÿ”„ Universal Standard: Unlike fragmented formats, enables seamless dataset combination for large-scale foundation models (OWAMcap)
  • ๐ŸŽฏ High-Performance Multimodal Storage: Lightweight MCAP container with nanosecond precision for synchronized data streams (MCAP)
  • ๐Ÿ”— Flexible MediaRef: Smart references to both external and embedded media (file paths, URLs, data URIs, video frames) with lazy loading - keeps metadata files small while supporting rich media (OWAMcap) โ†’ Learn more
  • ๐Ÿค— Training Pipeline Ready: Native HuggingFace integration, seamless dataset loading, and direct compatibility with ML frameworks (Ecosystem) โ†’ Browse datasets | Data pipeline

๐Ÿ“– Learn More: Why OWAMcap? | Complete Format Guide | vs Other Formats

$ owl mcap info example.mcap
library:   mcap-owa-support 0.5.1; mcap 1.3.0
profile:   owa
messages:  864
duration:  10.3574349s
start:     2025-06-27T18:49:52.129876+09:00 (1751017792.129876000)
end:       2025-06-27T18:50:02.4873109+09:00 (1751017802.487310900)
compression:
        zstd: [1/1 chunks] [116.46 KiB/16.61 KiB (85.74%)] [1.60 KiB/sec]
channels:
        (1) window           11 msgs (1.06 Hz)    : desktop/WindowInfo [jsonschema]
        (2) keyboard/state   11 msgs (1.06 Hz)    : desktop/KeyboardState [jsonschema]
        (3) mouse/state      11 msgs (1.06 Hz)    : desktop/MouseState [jsonschema]
        (4) screen          590 msgs (56.96 Hz)   : desktop/ScreenCaptured [jsonschema]
        (5) mouse           209 msgs (20.18 Hz)   : desktop/MouseEvent [jsonschema]
        (6) keyboard         32 msgs (3.09 Hz)    : desktop/KeyboardEvent [jsonschema]
channels: 6
attachments: 0
metadata: 0

๐Ÿ› ๏ธ CLI Tools (owl)

# Data analysis
owl mcap info session.mcap              # File overview & statistics
owl mcap cat session.mcap --n 10        # View messages
owl video probe session.mkv             # Video analysis

# Environment management
owl env list                            # List plugins
owl env search "mouse.*click"           # Search components
owl messages show desktop/KeyboardEvent # View schemas

๐Ÿ’ก Complete CLI Reference: For detailed information about all CLI commands and options, see the CLI Tools documentation.

Installation

Quick Start

# Install all OWA packages
pip install owa

# For video recording/processing, install GStreamer dependencies first:
conda install open-world-agents::gstreamer-bundle
pip install owa

๐Ÿ’ก When do you need GStreamer?

  • Video recording with ocap desktop recorder
  • Real-time screen capture with owa.env.gst
  • Video processing capabilities

Skip GStreamer if you only need:

  • Data processing and analysis
  • ML training on existing datasets
  • Headless server environments

Editable Install (Development)

For development or contributing to the project, you can install packages in editable mode. For detailed development setup instructions, see the Installation Guide.

Features

๐ŸŒ Environment Framework: "USB-C of Desktop Agents"

  • โšก Real-time Performance: Optimized for responsive agent interactions (GStreamer components achieve <30ms latency)
  • ๐Ÿ”Œ Zero-Configuration: Automatic plugin discovery via Python Entry Points
  • ๐ŸŒ Event-Driven: Asynchronous processing that mirrors real-world dynamics
  • ๐Ÿงฉ Extensible: Community-driven plugin ecosystem

โ†’ View Environment Framework Guide

๐Ÿ“Š Data Infrastructure: Complete Pipeline

  • ๐Ÿ”„ Universal Standard: Unlike fragmented formats, enables seamless dataset combination for large-scale foundation models (OWAMcap)
  • ๐ŸŽฏ High-Performance Multimodal Storage: Lightweight MCAP container with nanosecond precision for synchronized data streams (MCAP)
  • ๐Ÿ”— Flexible MediaRef: Smart references to both external and embedded media (file paths, URLs, data URIs, video frames) with lazy loading - keeps metadata files small while supporting rich media (OWAMcap) โ†’ Learn more
  • ๐Ÿค— Training Pipeline Ready: Native HuggingFace integration, seamless dataset loading, and direct compatibility with ML frameworks (Ecosystem) โ†’ Browse datasets | Data pipeline

โ†’ View Data Infrastructure Guide

๐Ÿค— Community & Ecosystem

  • ๐ŸŒฑ Growing Ecosystem: Hundreds of community datasets in unified OWAMcap format
  • ๐Ÿค— HuggingFace Integration: Native dataset loading, sharing, and interactive preview tools
  • ๐Ÿงฉ Extensible Architecture: Modular design for custom environments, plugins, and message types
  • ๐Ÿ’ก Community-Driven: Plugin ecosystem spanning gaming, web automation, mobile control, and specialized domains

โ†’ View Community Datasets

Documentation

Contributing

We welcome contributions! Whether you're:

  • Building new environment plugins
  • Improving performance
  • Adding documentation
  • Reporting bugs

Please see our Contributing Guide for details.

License

This project is released under the MIT License. See the LICENSE file for details.

Citation

If you find this work useful, please cite our paper:

@article{choi2025d2e,
  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
  author={Choi, Suwhan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
  journal={arXiv preprint arXiv:2510.05684},
  year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

owa-0.6.2.tar.gz (25.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

owa-0.6.2-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file owa-0.6.2.tar.gz.

File metadata

  • Download URL: owa-0.6.2.tar.gz
  • Upload date:
  • Size: 25.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for owa-0.6.2.tar.gz
Algorithm Hash digest
SHA256 b5d5b9c1a862c5432223804d9a5eb77c90f1f448d21d38a623ce5d078a6645b8
MD5 73c00b9467db8fb2e8339836c3d24048
BLAKE2b-256 84bf9a7876153701f85c98bbeb25908d5ffbaa23aa3f4a53f09baca0cb1fed92

See more details on using hashes here.

File details

Details for the file owa-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: owa-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for owa-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9f6d654f36af258ba42d970130d660aaa90353f2273b1200632092a1f01495fe
MD5 5c2079f9bf5dde057c878e11b3662575
BLAKE2b-256 16e0ad138cb211d5ebd433b8efc0a2be64b1e521d8c96c9aa0a21239422a5414

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page