Delta-first computer-use agent framework for OS-level and OSWorld environments

Project description

DeltaVision-OS

Delta-first computer-use agent framework for OS-level and OSWorld environments.

Sibling to deltavision, which targets browsers via Playwright. This project extends the DeltaVision observation middleware to the full desktop: any native application, any OS task, any OSWorld VM benchmark.

Status

238 tests passing (229 offline, 9 need a real display)
Live V2 E2E: real Qwen2.5-VL on an RTX 5080 via SSH tunnel, driving a real Mac desktop through the agent loop — 5 steps, 4.6% median diff, 56% hypothetical token savings vs full-frame. Video at benchmarks/v2_live_demo.mp4.
4 benchmarks checked in: idle-desktop observation, pipeline perf, classifier sensitivity sweep, and the live-demo recorder.
OSWorld integration still stubbed.

Scope

	`deltavision` (V1)	`deltavision-os` (V2, this repo)
Observation source	Playwright screenshots	`mss` OS-level capture, OSWorld VM frames
Action space	click, type, scroll, key, wait	+ drag, double-click, right-click, hotkey
Eval targets	Wikipedia, TodoMVC, GitHub, classifier sites	OSWorld 369-task suite
Model backends	Claude, OpenAI, Ollama	+ llama.cpp / OpenAI-compat server (Qwen2.5-VL verified, MAI-UI-8B / Qwen3-VL targeted)
Dependencies	Playwright + 5 pip packages	+ mss, pyautogui, OSWorld harness
Status	Frozen @ paper artifact	Active development

If you want browser automation, use V1. If you want desktop / OS / OSWorld, use this.

Quick Start

git clone https://github.com/ddavidgao/deltavision-os.git
cd deltavision-os
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# 238 tests (9 need a real display; those skip on CI)
pytest tests/ -q

# Live desktop benchmark (pure CV, no model needed)
python benchmarks/desktop_idle_observe.py --rounds 5 --interval 0.5

See TESTS.md for a full breakdown of what each test covers.

Expected benchmark output on a quiet desktop:

step   1  delta     diff=0.000  phash= 0  anchor=1.00  trigger=none
step   2  delta     diff=0.000  phash= 0  anchor=1.00  trigger=none
step   3  delta     diff=0.000  phash= 0  anchor=1.00  trigger=none
step   4  delta     diff=0.000  phash= 0  anchor=1.00  trigger=none
step   5  delta     diff=0.000  phash= 0  anchor=1.00  trigger=none

Observed 5 steps in 3.1s
DELTA:     5 (100.0%)
NEW_PAGE:  0

Token savings if paired with a VLM at 1600 tok/full_frame, ~400 tok/delta:
  Full frame every step:  8,000 tokens
  DeltaVision gated:      2,000 tokens (6,000 saved)

Running the CLI

# Scripted model: real capture, no real actions (safe demo)
python main.py --task "observe desktop" --platform os --backend scripted --max-steps 5

# Claude API (real model, but DRIVES YOUR MOUSE — start with small max-steps)
export ANTHROPIC_API_KEY=sk-...
python main.py --task "..." --platform os --backend claude --safety strict --max-steps 10

# Local VLM over an OpenAI-compatible endpoint (llama.cpp / vLLM / SGLang / Ollama via tunnel)
python main.py --task "..." --platform os --backend llamacpp \
    --host 127.0.0.1 --port 11434 --model qwen2.5vl:7b

# Ablation: force full-frame (disable delta gating)
python main.py --task "..." --platform os --backend claude --force-full-frame

Warning: --platform os drives the REAL mouse and keyboard. Start with --safety strict and small --max-steps until you trust the model.

Architecture

deltavision-os/
├── capture/          # Platform abstraction (5-method ABC)
│   ├── base.py           Platform class
│   ├── os_native.py      mss + pyautogui impl (macOS/Linux/Windows)
│   └── osworld.py        OSWorld VM wrapper (stub)
├── vision/           # Zero-LLM CV pipeline (ported from V1)
│   ├── diff.py, classifier.py, phash.py, crops.py
├── agent/
│   ├── loop.py           Platform-agnostic agent loop
│   ├── state.py          Observation + response history
│   └── actions.py        10 typed actions (V1's 6 + DRAG/DOUBLE_CLICK/RIGHT_CLICK/HOTKEY)
├── observation/      # FullFrame + Delta observation types
├── model/            # Pluggable backends
│   ├── base.py, _response_parser.py  shared
│   ├── llamacpp.py       V2 new: OpenAI-compat for local VLMs
│   ├── scripted.py       for testing without API costs
│   └── claude.py, openai.py, ollama.py  (carried from V1)
├── safety.py         # Model-agnostic action validation
├── config.py         # All thresholds, validated at construction
├── results/          # SQLite result store
├── benchmarks/       # desktop_idle_observe, pipeline_perf, classifier_sensitivity, record_live_demo
├── main.py           # CLI entrypoint
└── tests/            # 238 passing (229 offline, 9 need display)

Shared concept with V1

The core insight is identical: a zero-LLM CV pipeline gates what the model sees. Full frame on NEW_PAGE, delta thumbnail + crops on DELTA. Same 4-layer classifier cascade. Same ~80% token savings on sticky-context tasks.

The platform abstraction is new. V1 had three Playwright-specific callsites in its loop; V2 replaces them with a generic Platform interface that any capture+execute backend can implement.

What's working

Platform ABC with async context manager lifecycle
OSNativePlatform: mss capture + pyautogui actions (macOS verified)
OSWorld platform stub (waits for env harness)
4-layer CV classifier cascade (URL → diff → pHash → anchor) ported from V1
Agent loop with force-refresh on no-effect streaks
10 action types including DRAG with x2/y2
Safety layer (credential / URL / action limits)
Model backends: Claude, OpenAI, Ollama, llama.cpp server, scripted
238 passing tests
Live desktop benchmark proves CV pipeline works without browser
First real V2 E2E: Qwen2.5-VL on remote RTX 5080 via SSH tunnel, 5-step Mac-desktop run with 56% hypothetical token savings (benchmarks/v2_live_demo.mp4)
Classifier sensitivity sweep (synthetic damage 0%→99%) confirms pHash is the first layer to fire on real transitions

What's next

OSWorld VM integration (needs OSWorld env install)
First V1 benchmark port (run_ablation.py equivalent with OS-native driver)
Production VLM endpoint with MAI-UI-8B / Qwen3-VL-8B on the 5080 box (Qwen2.5-VL is the current stand-in)
Migration of V1 paper section 5 (OS-level experiments)

License

MIT. Same as V1.

Project details

Release history Release notifications | RSS feed

0.1.0a1 pre-release

Apr 19, 2026

This version

0.1.0a0 pre-release

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltavision_os-0.1.0a0.tar.gz (59.8 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deltavision_os-0.1.0a0-py3-none-any.whl (50.8 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file deltavision_os-0.1.0a0.tar.gz.

File metadata

Download URL: deltavision_os-0.1.0a0.tar.gz
Upload date: Apr 18, 2026
Size: 59.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deltavision_os-0.1.0a0.tar.gz
Algorithm	Hash digest
SHA256	`19a8056745b1cf594cf62c6f5432186910036f23a4b6e327c69442345df00c7c`
MD5	`9570ba96b0bf08c73eaa060c51aaa0a1`
BLAKE2b-256	`268a564911a932a907181d33ee9355ce1e9c5de2cee288a22ddc6363e676b146`

See more details on using hashes here.

File details

Details for the file deltavision_os-0.1.0a0-py3-none-any.whl.

File metadata

Download URL: deltavision_os-0.1.0a0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 50.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deltavision_os-0.1.0a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cdfdb13c25888934304f17742365323fba148ed25e5be927d8d44a751a60f893`
MD5	`7d358402b1e81805bd383181232f0b36`
BLAKE2b-256	`c6e5e87bb43e2009806f2ec4bd313ffb80c36338ea56ce92061de4f53cc99c9f`

See more details on using hashes here.

deltavision-os 0.1.0a0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

DeltaVision-OS

Status

Scope

Quick Start

Running the CLI

Architecture

Shared concept with V1

What's working

What's next

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes