Skip to main content

A robust browser automation tool for AI agents - control browsers via CLI or IPC

Project description

agent-browser

Browser automation for AI agents. Control browsers via MCP (Model Context Protocol) or CLI.

PyPI version License: GPL v3 Python 3.10+

Installation

pip install ai-agent-browser
playwright install chromium

Quick Start by AI Tool

Most AI coding assistants support MCP (Model Context Protocol). Add agent-browser to your tool's config and the AI handles everything automatically.

Claude Code

claude mcp add agent-browser -- agent-browser-mcp --allow-private

Or manually edit ~/.claude/claude_desktop_config.json:

{
  "mcpServers": {
    "agent-browser": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private"]
    }
  }
}

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "agent-browser": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private"]
    }
  }
}

Cursor

Edit ~/.cursor/mcp.json:

{
  "mcpServers": {
    "agent-browser": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private"]
    }
  }
}

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "agent-browser": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private"]
    }
  }
}

VS Code + Cline

Open Cline settings and add to MCP Servers:

{
  "agent-browser": {
    "command": "agent-browser-mcp",
    "args": ["--allow-private"]
  }
}

gemini-cli

Edit ~/.gemini/settings.json:

{
  "mcpServers": {
    "agent-browser": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private"]
    }
  }
}

OpenAI Codex CLI

codex --mcp-config mcp.json

Create mcp.json in your project:

{
  "mcpServers": {
    "agent-browser": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private"]
    }
  }
}

Aider (CLI Mode)

Aider doesn't support MCP yet. Use CLI mode instead:

# Add to your aider config or prompt:
# "You can control a browser using agent-browser CLI commands"

# In one terminal, start the browser:
agent-browser start http://localhost:3000 --session dev

# Aider can then run commands like:
agent-browser cmd screenshot --session dev
agent-browser cmd click "#login" --session dev
agent-browser cmd fill "#email" "test@example.com" --session dev

Other MCP Clients

For any MCP-compatible client, the server command is:

agent-browser-mcp [OPTIONS]

Options:
  --allow-private  Allow localhost/private IPs (for local development)
  --visible        Show browser window (for debugging)

What Can It Do?

agent-browser provides 68 browser automation tools organized into categories:

Category Tools Examples
Navigation 5 goto, back, forward, reload, get_url
Interactions 9 click, fill, type, select, press, hover, upload
Waiting 6 wait_for, wait_for_text, wait_for_url, wait_for_change
Data Extraction 6 screenshot, text, value, attr, count, evaluate
Assertions 3 assert_visible, assert_text, assert_url
Page State 5 scroll, viewport, cookies, storage, clear
Debugging 3 console, network, dialog
Agent Utilities 7 page_state, validate_selector, suggest_next_actions, browser_status
Perception 3 get_page_markdown, get_accessibility_tree, find_relative
Advanced 3 highlight, mock_network, clear_mocks
Video Production 18 start_recording, generate_voiceover, camera_zoom, merge_audio_video

For AI agents: See AGENT.md for a concise reference with selector syntax, common patterns, and tool safety levels.

Feature Showcase

Research & Extraction Cross-Page Audits Data Operations
Claude Demo Gemini Demo Interpreter Demo
Autonomous web research Multi-page analysis Table scraping to structured data

Common Use Cases

Test a Local Web App

You: "Test the login flow on my React app at localhost:3000"

AI: I'll test the login flow for you.
    [navigates to localhost:3000]
    [takes screenshot of login page]
    [fills email and password fields]
    [clicks submit button]
    [waits for dashboard to load]
    [asserts URL contains /dashboard]
    [takes screenshot of result]

    Login flow works correctly. The user is redirected to /dashboard after successful login.

Scrape Data from a Website

You: "Extract the product names and prices from this page"

AI: I'll extract the product data.
    [uses get_page_markdown to read structured content]
    [or uses evaluate to run JavaScript extraction]

    Found 24 products:
    - Product A: $29.99
    - Product B: $49.99
    ...

Debug a Visual Issue

You: "The submit button isn't visible on mobile viewport"

AI: Let me check that.
    [sets viewport to 375x667 (iPhone)]
    [navigates to the page]
    [uses validate_selector to check button exists]
    [uses highlight to show button location]
    [takes screenshot]

    The button exists but is positioned off-screen. Here's a screenshot
    with the button highlighted. The CSS needs adjustment for mobile.

Create a Marketing Video

You: "Create a demo video of our checkout flow with voiceover"

AI: I'll create a marketing video with narration.
    [generates voiceovers for each scene]
    [starts recording at 1920x1080]
    [navigates to checkout page]
    [moves cursor smoothly to cart button]
    [adds annotation "Click to checkout"]
    [camera zooms into form fields]
    [fills form with human-like typing]
    [stops recording]
    [merges video with voiceover audio]

    Created checkout_demo.mp4 (45 seconds) with synchronized narration.

Cinematic Engine (Video Production)

Create marketing-grade videos with AI-controlled browser recordings, voiceovers, and post-production.

Installation

pip install ai-agent-browser[video]

Requirements:

  • OPENAI_API_KEY environment variable (for TTS)
  • ffmpeg installed (for post-production)

Capabilities

Phase Tools Description
Voice & Timing generate_voiceover, get_audio_duration Generate TTS audio, get timing for sync
Recording start_recording, stop_recording, recording_status Capture video with virtual cursor
Annotations annotate, clear_annotations Floating text callouts
Camera camera_zoom, camera_pan, camera_reset Ken Burns-style zoom/pan effects
Post-Production merge_audio_video, add_background_music Combine video + audio tracks
Polish smooth_scroll, type_human, set_presentation_mode Human-like interactions

Example Workflow

# 1. Generate voiceovers first (for timing)
vo1 = generate_voiceover("Welcome to our product demo", voice="nova")
vo2 = generate_voiceover("Here's how to get started", voice="nova")

# 2. Record browser session with effects
start_recording(width=1920, height=1080)
goto("https://example.com")
annotate("Our Landing Page", style="dark")
camera_zoom("#hero", level=1.5)
smooth_scroll("down", amount=500)
type_human("#search", "AI automation", wpm=60)
stop_recording()

# 3. Merge voiceovers at specific timestamps
merge_audio_video(
    video="recording.webm",
    audio_tracks=[
        {"path": vo1["data"]["path"], "start_ms": 0},
        {"path": vo2["data"]["path"], "start_ms": 8000}
    ],
    output="final_demo.mp4"
)

Virtual Cursor

The recording includes a virtual cursor with smooth, human-like movement:

// Cursor is controlled via JavaScript injection
window.__agentCursor.moveTo(x, y, duration_ms)  // Smooth move
window.__agentCursor.click(x, y)                 // Click with ripple effect

The cursor uses cubic-bezier easing for natural motion, not robotic linear movement.

Security Features

agent-browser is designed for safe use with AI agents:

  • SSRF Protection: Blocks dangerous schemes (file://, javascript://, data://) and private IPs by default
  • DNS Rebinding Protection: Resolved IPs are validated against private ranges
  • Cloud Metadata Protection: Blocks AWS/GCP metadata endpoints (169.254.169.254)
  • Path Sandboxing: File operations restricted to working directory
  • Credential Rejection: URLs with embedded user:pass are blocked
  • Sensitive Field Masking: Password fields masked in page_state output

Use --allow-private only when testing local development servers.

Advanced Configuration

Dual Instances (Headless + Visible)

Run two browser instances - one for speed, one for debugging:

{
  "mcpServers": {
    "agent-browser": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private"]
    },
    "agent-browser-visible": {
      "command": "agent-browser-mcp",
      "args": ["--allow-private", "--visible"]
    }
  }
}

Configuration Options

Use Case Args
Production (SSRF protected) []
Local development ["--allow-private"]
Debugging (visible browser) ["--allow-private", "--visible"]

CLI Mode

For tools that don't support MCP, or for scripting:

Basic Usage

# Terminal 1: Start browser (blocks while running)
agent-browser start http://localhost:8080

# Terminal 2: Send commands
agent-browser cmd screenshot home
agent-browser cmd click "#submit"
agent-browser cmd fill "#email" "test@example.com"
agent-browser cmd assert_visible ".success"

# When done
agent-browser stop

Session Management

Run multiple browsers concurrently:

# Start separate sessions
agent-browser start http://localhost:3000 --session app1
agent-browser start http://localhost:4000 --session app2

# Commands target specific sessions
agent-browser cmd screenshot --session app1
agent-browser cmd click "#btn" --session app2

# Stop individually
agent-browser stop --session app1

Interactive Mode

REPL for manual testing:

agent-browser interact http://localhost:8080

> screenshot initial
> click #login
> fill #email "test@example.com"
> assert_visible .dashboard
> quit

CLI Command Reference

Click to expand full CLI reference

Browser Control

Command Description
start <url> Start browser session
start <url> --visible Start with visible window
stop Close browser
status Check if browser running

Navigation

Command Description
cmd goto <url> Navigate to URL
cmd back Go back
cmd forward Go forward
cmd reload Reload page

Interactions

Command Description
cmd click <selector> Click element
cmd fill <selector> <text> Fill input
cmd type <selector> <text> Type with key events
cmd select <selector> <value> Select dropdown
cmd press <key> Press key (Enter, Tab, etc.)
cmd scroll <direction> Scroll (up/down/top/bottom)

Screenshots & Data

Command Description
cmd screenshot [name] Take screenshot
cmd text <selector> Get text content
cmd value <selector> Get input value
cmd count <selector> Count elements

Assertions

Command Description
cmd assert_visible <sel> Check visibility
cmd assert_text <sel> <text> Check text content
cmd assert_url <pattern> Check URL

Debugging

Command Description
cmd console View JS console
cmd network View network log
cmd wait <ms> Wait milliseconds
cmd wait_for <selector> Wait for element

Architecture

┌─────────────────┐      MCP/JSON-RPC       ┌─────────────────┐
│   AI Assistant  │ ◄──────────────────────►│  agent-browser  │
│ (Claude, Cursor,│                         │   MCP Server    │
│  Gemini, etc.)  │                         │                 │
└─────────────────┘                         └────────┬────────┘
                                                     │
                                                     ▼
                                            ┌─────────────────┐
                                            │   Playwright    │
                                            │   (Chromium)    │
                                            └─────────────────┘

The MCP server manages browser lifecycle automatically. For CLI mode, a file-based IPC system coordinates between the CLI process and a persistent browser process.

Troubleshooting

Problem Solution
"Private IP blocked" Add --allow-private for localhost testing
"Element not found" Use validate_selector to check selector
"Timeout waiting" Increase timeout or use wait_for first
Browser not responding Check browser_status or restart
MCP not connecting Verify config path and restart AI tool

Python API

from agent_browser import BrowserDriver

driver = BrowserDriver(session_id="test")
result = driver.send_command("screenshot home")
print(result)

Contributing

See CONTRIBUTING.md for guidelines.

License

GNU General Public License v3.0 - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_agent_browser-0.2.0.tar.gz (75.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_agent_browser-0.2.0-py3-none-any.whl (68.3 kB view details)

Uploaded Python 3

File details

Details for the file ai_agent_browser-0.2.0.tar.gz.

File metadata

  • Download URL: ai_agent_browser-0.2.0.tar.gz
  • Upload date:
  • Size: 75.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ai_agent_browser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 000faa4646d8434e6f0d8af8f1d5aac67efc278776e93a88ae0d828731f17f6f
MD5 3ccd6afc7006624158c677be1b1901e2
BLAKE2b-256 874c2557dbb7e733f62c33e130ed2e87e5abe3add60a417256ea72f4226d8771

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_agent_browser-0.2.0.tar.gz:

Publisher: publish.yml on abhinav-nigam/agent-browser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ai_agent_browser-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_agent_browser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37bd1bf23380015ebfad0203818c1d7b2cac30427b1c2d612c95a6f7298dcc34
MD5 7ab4b2b35eda3a38148acb55c6ed80b0
BLAKE2b-256 064324654cb673172707f109defd5a5bf795631661b3eb0a283b4b3ee0887ecb

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_agent_browser-0.2.0-py3-none-any.whl:

Publisher: publish.yml on abhinav-nigam/agent-browser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page