Skip to main content

A clean, learnable browser automation framework — Chat mode + Preset replay via Playwright CDP

Project description

yak-browser-use logo

Yak Browser-Use

CHAT · BROWSER · AUTOMATE

An AI Agent framework that chats with you while operating the browser

Python ≥3.12 MIT License Alpha Playwright Electron Desktop Web UI 中文

English · 简体中文


What is Ybu?

yak-browser-use (aliased as ybu) is a browser automation AI Agent framework. Its core interaction model:

You chat with the Agent → Agent controls the browser → You watch it happen in real-time

Two modes:

  • Chat Mode — natural language conversational control, Agent browses while chatting
  • Preset Mode — replay recorded pipelines, Agent executes pre-defined steps autonomously

Built on Playwright connect_over_cdp() and an OpenAI-compatible LLM client.

Built from the ground up. ybu is an independent codebase with its own conversation loop, progressive snapshot engine, CDP integration, and pipeline compiler — designed entirely from scratch for this project. It has zero browser-use dependencies and shares no code with any other browser automation framework.


Features

# Feature Why It Matters
1 Live DOM highlighting with cross-tab isolation — Two-layer overlay (container + floating divs), RAF-throttled repaint, MutationObserver for lightweight re-render. Periodic background guard prevents desync across tabs. Every tab gets its own highlight state. Most browser AI tools have no live highlights, or use inline styles that tear on scroll and leak across tabs. Ybu's system survived real production stress tests — it stays put after navigation, scroll, and SPA transitions.
2 Three snapshot strategies for different page typesprogressive (density-adaptive DOM walk, ≤200 elements, fold dense containers with expand_branch) for normal pages; a11y (accessibility tree, works in iframes and locked DOM) for tricky pages; simplified (structured summary: headings, links, lists, tables, body text) for low-token overview. The LLM selects the right mode — you never worry about the choice. Single-strategy snapshots fail on different page types (SPA, iframe-heavy, locked DOM). Three strategies maximize coverage without the LLM having to figure out the page's quirks — it just picks the right mode for the job.
3 Progressive snapshot's adaptive density disclosure — Not truncation. The walker reads the document depth-first, measures container density per depth, folds anything above threshold, and presents a flattened view with expand_branch handles the LLM can pull on demand. Other frameworks truncate at N elements and lose the rest. Progressive's fold-and-expand lets the LLM see the page shape and dig into relevant sections without wasting tokens on boilerplate.
4 Pipeline as byproduct — Ybu doesn't require pre-defined pipelines. Chat first, record later. pipeline.yaml is a recording artifact from chat sessions, not a design starting point. Useful flows get saved and replayed. Lowers the adoption bar: you don't plan automation flows, you just chat and the Agent writes them for you. Pipeline design emerges from real interaction instead of upfront spec.
5 Shared Store dual-syntax template resolution{path} (whole-value reference, preserves type) and ${path} (inline string interpolation, $ prefix disambiguates from JSON braces). Designed as two separate needs, not accidental inconsistency. Pass entire data structures between tools ({step_3}), or interpolate values inside URLs and templates (https://${host}/api). Each syntax has clear semantics and failure modes.
6 Scratchpad for heavy data — HTML dumps, screenshot base64, element lists go to in-memory scratchpad. LLM sees summaries and fetches detail on demand via browser_source(cached=true) or browser_get_element_by_number(@e5). Keeps the LLM context window clean without discarding data. The Agent decides what detail it needs rather than guessing up-front.
7 Eval Agent + Shared Store data bridge — The eval subagent inherits the main conversation's shared_store. Tools write results via source_key, and eval reads them through {path} / ${path} template resolution. Eval can verify tool outputs inline, and tool flows can trigger eval as a verification step. Eval is not a separate post-hoc system — it lives in the same data flow as tools. The shared store bridges tool production and eval consumption, enabling real-time verification loops.
8 Three-step pipeline with programmatic checks — Pipeline steps are goal → ops → check, where check supports url_contains, element_exists, text_contains, element_visible — deterministic programmatic verification, not LLM opinion. Most pipeline frameworks leave verification to the LLM. Ybu's programmatic checks are fast, deterministic, and independent of LLM cost/latency — a trivial check doesn't need a model call.
9 Structured error recovery ecosystemerror_classifier (categorizes failures) → retry_utils (configurable backoff) → turn_context (per-turn retry counters), guided by error_recovery system prompt. All wired together, not ad-hoc try/except. Real browser automation fails constantly (network timeout, element not found, CDP disconnect). A structured recovery pipeline means the Agent survives real-world chaos without dumping errors on the user.
10 Guardian approval gate + circuit breaker + compensation rollback — Three-layer safety lifecycle. Guardian gates sensitive operations for human approval, circuit breaker prevents cascading failures, compensation undoes changes on rollback. Browser automation can break things. The safety lifecycle means destructive operations require approval, repeated failures don't cascade, and rollback is possible — not just "oops."
11 Chat + Browser Sync & Streaming LLM — User types commands → Agent operates browser → reasoning, text deltas, and tool calls stream back via WebSocket in real-time No config files, no scripts. Just natural language driving the browser. See the Agent think as it works, not just the final result.
12 Rich Browser Toolkit — 22 browser atomics (goto, click, fill, snapshot, scroll, eval, hover, tab…) covering daily automation Broad enough for real-world tasks, granular enough for precise control.
13 Custom Tool Scripts — Hot-load Python scripts via ToolRegistry; built-in captcha, file I/O, format conversion Extend the agent without modifying core code. Drop in a script, it just works.
14 Electron Desktop + Web UI — React + Vite + Monaco Editor frontend with diff editor; FastAPI backend serving REST endpoints, WebSocket event streams, and static frontend. Run as Electron desktop app or uvx yak-browser-use for instant browser-based UI. An IDE-like environment for building pipelines, with an API that integrates into any frontend or CI pipeline. One-command web launch removes the Electron dependency for quick demos.
15 Connection Health & Session Persistence — CDP heartbeat + process watcher + auto-disconnect handling; per-pipeline session directories with full conversation history Keeps long-running automation alive through network blips and browser restarts. Never lose context — pick up where you left off.
16 Flexible Providers — DeepSeek / OpenAI / any OpenAI-compatible provider via flat JSON config Use the model you want, not the one we chose for you.

Quick Start

Prerequisites

Dependency Version Install
Python ≥ 3.12 python.org
uv ≥ 0.4 powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
Node.js ≥ 18 nodejs.org
Chrome / Chromium ≥ 120 Your existing Chrome, or uv run playwright install chromium

Install

# Windows one-click
install.bat

# Or manual three steps
cd backend
uv sync                              # Install Python deps
uv run playwright install chromium   # Install Playwright Chromium
cd ../electron
npm install                          # Install Electron frontend deps

Start

# CLI mode
cd backend
uv run python __main__.py --help

# Start REST API server
uv run python __main__.py serve --port 8080

# Start Web UI (browser-based, no Electron needed)
uv run python __main__.py web
# Or one-command: uvx yak-browser-use

# Start Electron desktop
cd electron
npm run electron:dev

Configure Provider

Create userdata/provider.json (or configure via Electron Settings → LLM Provider):

{
  "model": "deepseek-chat",
  "api_key": "sk-xxx...xxxx",
  "api_base": "https://api.deepseek.com"
}

Commands

ybu run <path>                 Execute a pipeline.yaml
ybu serve [--port PORT]        Start the REST API server
ybu web                        Start the Web UI (browser, no Electron)
ybu logs [-f] [--source all]   View unified logs

CLI commands: serve, run, web, logs. Config via Web UI / Electron Settings (not CLI subcommands).


How It Works

Two-Layer Architecture

┌─────────────────────────────────────────────────────┐
│              Orchestration Layer                     │
│  conversation_loop → LLM decides → tool_executor     │
│  chat mode / preset mode / error recovery            │
└──────────────────┬──────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────┐
│              Browser Control Layer (CDP)              │
│  PlaywrightBridge → connect_over_cdp() → Chrome      │
│  CDPHelpers / ToolContext / ToolCDPHelpers           │
└─────────────────────────────────────────────────────┘

Two Execution Modes

Chat Mode (Interactive)

POST /api/chat { message: "Open Baidu and search for coffee" }
  └→ service.process_chat_message()
       └→ run_conversation_loop()
            ├→ Load chat/system.md + pipeline context
            ├→ LLM call (browser_* / goal_run / todo / skill / expand_branch)
            ├→ LLM returns tool calls → tool_executor (with shared_store)
            │     ├→ browser_goto  → ops.py → PlaywrightBridge.goto()
            │     ├→ browser_click → ops.py → PlaywrightBridge.click()
            │     ├→ browser_snapshot → progressive/a11y/raw snapshot
            │     └→ record_step   → append to pipeline.yaml
            └→ LLM returns text → end turn

Key Points:

  • User watches the browser and types commands; Agent operates autonomously
  • Streaming LLM response (reasoning + text) pushed in real-time
  • WebSocket event stream: turn_start / tool_start / text_chunk
  • Agent auto-records operation steps to pipeline.yaml
  • Tool-to-tool data passing via shared_store (${} templates / _source_key)

Preset Mode (Pipeline Replay)

POST /api/run { pipeline: "..." }
  └→ run_pipeline() / run_preset_loop()
       ├→ Load previously recorded pipeline.yaml
       ├→ Feed step list into conversation_loop
       ├→ System prompt = build_system_prompt() + Step list
       ├→ error_recovery.md loaded unconditionally in Agent init
       ├→ LLM sees the full step list
       ├→ Executes steps one by one with browser_* tools
       ├→ shared_store passthrough for data flow
       └→ Guided error recovery via error_recovery.md prompt + retry utilities

Key Points:

  • Repeatable automation workflow
  • Pipeline three-step design: goalops (browser ops) → check (programmatic verification)
  • check supports: url_contains / element_exists / text_contains / element_visible
  • Pipeline context injected into system prompt for workspace awareness

Project Structure

yak-browser-use/
├── __main__.py              # CLI entry (run/serve/logs)
├── pyproject.toml            # Project config + deps
│
├── api/                      # FastAPI REST + WebSocket
│   ├── routes.py             # Route registration
│   ├── service.py            # Business logic
│   ├── server.py             # Server lifecycle
│   └── state.py / errors.py  # Engine state & error types
│
├── engine/                   # Core execution engine ★
│   ├── agent.py              # Agent entry + streaming LLM call
│   ├── runner.py             # Chat mode runner
│   ├── runner_preset.py      # Preset mode orchestrator
│   ├── executor.py           # Pipeline wrappers (browser/tool/goal)
│   ├── ops.py                # Browser op dispatcher via BrowserBridge
│   ├── scratchpad.py         # In-memory data cache
│   ├── step_machine.py       # Pipeline DAG walker
│   ├── eval_agent.py         # Eval Agent for verification
│   ├── delivery.py / events.py / state.py
│   ├── _param_resolver.py    # Templated param resolution
│   │
│   ├── _harness/             # Conversation loop infrastructure ★
│   │   ├── conversation_loop.py   # Core agent turn loop
│   │   ├── tools.py               # Tool definitions (browser_*/goal_run/…)
│   │   ├── tool_executor.py       # Sequental dispatcher + shared_store
│   │   ├── pipeline_tools.py      # Pipeline CRUD tools
│   │   ├── pipeline_events.py     # Centralized WS event propagation
│   │   ├── iteration_budget.py    # LLM turn budget control
│   │   ├── tool_guardrails.py     # Tool call guardrails
│   │   ├── turn_context.py        # Per-turn context (retry counters)
│   │   ├── error_classifier.py    # Error classification
│   │   ├── retry_utils.py         # Retry utilities
│   │   └── skill_tools.py         # Skill injection
│   │
│   └── _lifecycle/           # Pipeline lifecycle management
│       ├── guardian.py       # Approval gate + circuit breaker
│       └── compensation.py   # Rollback / undo support
│
├── cdp/                      # Chrome DevTools Protocol layer ★
│   ├── playwright_bridge.py  # PlaywrightBridge — unified driver
│   │                        #   (health check / process watch / disconnect)
│   ├── helpers.py            # CDPHelpers high-level API
│   ├── protocols.py          # BrowserBridge protocol interface
│   ├── profiles.py / session.py  # Profile & session management
│   ├── discover.py           # Chrome discovery / connection
│   └── launcher.py           # Chrome launch / port mgmt
│
├── compiler/                 # Pipeline compilation
│   ├── models.py / schema.py # Data classes & Pydantic models
│   ├── parser.py             # YAML parser
│   ├── graph.py / resolver.py# DAG builder + dependency resolver
│   ├── prepare.py            # Pre-execution step preparation
│   ├── step_type.py          # Unified step type inference
│   ├── diff.py               # Op diff computation
│   ├── generator.py          # Handler prompt & code generation
│
├── tools/                    # Tool registry + implementations
│   ├── registry.py           # ToolRegistry — central dispatch (43 tools)
│   ├── adapters.py           # Tool data adaptation (csv↔json, field mapping)
│   ├── captcha.py            # DOM-based CAPTCHA recognition (ddddocr)
│   ├── file_read.py / file_write.py / format_convert.py
│   ├── extract.py / data.py  # Data extraction & processing
│   ├── todo.py / todo_store.py  # Todo list management
│   ├── record_step.py        # Pipeline step recording
│   ├── edit_pipeline.py      # Pipeline editing with rollback
│   └── _path_utils.py        # Path traversal prevention
│
├── llm/                      # LLM client layer
│   ├── client.py             # LLMClient — OpenAI-compatible adapter
│   └── messages.py           # Message types (vendored OpenAI format)
│
├── prompts/                  # Prompt templates (Markdown)
│   ├── _loader.py            # Prompt loader (load_prompt / build_system_prompt)
│   ├── chat/system.md        # Chat mode system prompt (main)
│   ├── eval_agent/           # Eval Agent prompts
│   │   ├── system.md
│   │   └── js_lib.js
│   ├── guidance/             # Strategy & recovery guidance
│   │   ├── tool_strategy.md  #   Tool selection strategy
│   │   └── error_recovery.md #   Error recovery instructions
│   ├── guardrails/           # Guardrail prompt fragments
│   │   ├── blocked.md / exact_failure.md / no_progress.md
│   │   └── same_tool_failure.md / warning_prefix.md
│   ├── skill/                # System skills
│   │   ├── goal-execution/SKILL.md
│   │   ├── skill-authoring/SKILL.md
│   │   └── web-standard-paths/SKILL.md
│   ├── planner-plan.md / planner-expand.md
│   ├── replan-on-failure.md / generate-handler.md
│   └── _archived/            # Deprecated prompts
│
├── params/                   # Persistent parameter manager (ParamManager)
├── workspace/                # Workspace management (manager/version/path/session)
│   └── session_store.py      # Per-pipeline session persistence
├── cli/                      # CLI (run.py / serve.py / logs.py / web.py)
├── utils/                    # Utilities (browser/logging/tool_cdp/skill_loader/…)
├── tests/                    # 800+ unit & integration tests
│
├── electron/                 # Electron desktop frontend
│   └── src/
│       └── renderer/         # React + Vite + Monaco Editor (diff)
│
├── docs/                     # Documentation
│   └── architecture-overview.md  # Full architecture deep-dive
│
├── logo.png                  # Project logo
├── install.bat               # Windows one-click installer
├── run.bat                   # Quick launch script
├── README.md                 # This file (English)
└── README.zh-CN.md           # Chinese translation

Key Design Decisions

  1. PlaywrightBridge Unified Driver — All browser operations go through PlaywrightBridge (connect_over_cdp()), gaining auto-wait / auto-scroll / auto-retry, plus health check heartbeat, process watcher, disconnect handling, and SSRF guard. BrowserBridge protocol (cdp/protocols.py) defines the interface contract.

  2. File as Contract — pipeline.yaml is a static contract, strictly validated at compile time (DAG cycle detection, file reference validation), minimizing surprises at runtime.


Development

# Create and activate venv
cd backend
uv venv
source .venv/bin/activate   # Linux/macOS
.venv\Scripts\activate      # Windows

# Install dev dependencies
uv sync --dev

# Run tests
uv run pytest

# Coverage
uv run pytest --cov=.

# Open Chrome remote debugging port
chrome.exe --remote-debugging-port=9222

Dev Commands

Command Description
uv run python __main__.py serve --port 8080 Start API server
uv run python __main__.py web Start Web UI (browser)
uv run python __main__.py run path/to/pipeline.yaml Run a pipeline
uv run python __main__.py logs -f Tail logs live
cd electron && npm run electron:dev Start Electron frontend
cd electron && npm run dev:web Start Web frontend dev server (Vite HMR + proxy)

Architecture Docs

For a full architectural deep-dive (data flow diagrams, design principles, execution paths), see docs/architecture-overview.md.


License

MIT © 2026 Yak Browser-Use Contributors

See ACKNOWLEDGMENTS.md for project references and contributor credits.


yak
Built with yak power · Chat · Browser · Automate

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yak_browser_use-0.5.3.post1.tar.gz (3.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yak_browser_use-0.5.3.post1-py3-none-any.whl (3.8 MB view details)

Uploaded Python 3

File details

Details for the file yak_browser_use-0.5.3.post1.tar.gz.

File metadata

  • Download URL: yak_browser_use-0.5.3.post1.tar.gz
  • Upload date:
  • Size: 3.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yak_browser_use-0.5.3.post1.tar.gz
Algorithm Hash digest
SHA256 7fee7509e2434ec6e60a2a7334c7013277a0eb37966c0d4de38b94de470c3934
MD5 6a55a33ee53353841893f4dc451e5e54
BLAKE2b-256 64075ef7f105487b55fd6b623e39039b56b6a4b47c0189e4fcca4962f7295d02

See more details on using hashes here.

File details

Details for the file yak_browser_use-0.5.3.post1-py3-none-any.whl.

File metadata

  • Download URL: yak_browser_use-0.5.3.post1-py3-none-any.whl
  • Upload date:
  • Size: 3.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yak_browser_use-0.5.3.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 472458f93aed96cac42e892c0a529f358d2f5de83bf59336b99510555721132c
MD5 91df25248c6cef91d52fada659fb6873
BLAKE2b-256 ebde5a85f0b0e09d87c6b9df7b18e883993d95326a80568eba019024a969798f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page