Skip to main content

MCP server for browser automation with Set of Marks (SoM) - AI agents can see and interact with web pages using numbered element IDs

Project description

BrowserControl

๐ŸŒ BrowserControl

Give your AI agent real browser superpowers.

Python 3.11+ License: MIT MCP GitHub Stars

Quick Start โ€ข Features โ€ข Tools โ€ข Configuration โ€ข Examples


Ever wished Claude, Gemini, or your custom AI agent could actually browse the web? Not just fetch URLs, but truly see, click, type, and interact with any website like a human?

BrowserControl is an MCP server that gives your AI agent full browser access with a vision-first approach inspired by Google's AntiGravity IDE.

โœจ What Makes This Different

Traditional Web Access BrowserControl
Fetch static HTML See the rendered page
Parse complex DOM Point at numbered elements
Guess at selectors Just say "click 5"
No JavaScript support Full dynamic content
No login persistence Persistent sessions
No debugging tools Console, Network, Errors

๐ŸŽฏ The Secret: Set of Marks (SoM)

Every screenshot comes annotated with numbered red boxes on interactive elements:

Found 15 interactive elements:
  [1] button - Sign In
  [2] input - Search...
  [3] a - Products
  [4] a - Pricing
  [5] button - Get Started

Your agent sees the numbers and simply calls click(1) to sign in. No CSS selectors. No XPath. No guessing.


๐Ÿ† Why BrowserControl Beats Every Alternative

Head-to-Head Comparison

Feature BrowserControl Playwright MCP Stagehand Browser-Use AgentQL
Vision-First (SoM) โœ… Numbered boxes โŒ Text tree โš ๏ธ AI vision โš ๏ธ AI vision โŒ Selectors
No Extra AI Calls โœ… Zero โŒ Parses tree โŒ GPT-4V per action โŒ Vision model โŒ Query model
Developer Tools โœ… 6 tools โŒ None โŒ None โŒ None โŒ None
Session Recording โœ… Built-in โŒ Manual โŒ None โŒ None โŒ None
Persistent Sessions โœ… Automatic โš ๏ธ Manual setup โŒ None โŒ None โŒ None
MCP Native โœ… FastMCP โœ… Official โŒ Python SDK โš ๏ธ Custom โŒ REST API
Install Complexity โœ… pip install โš ๏ธ npx + config โŒ Docker + setup โš ๏ธ Docker โŒ Cloud signup
Token Efficiency โœ… Tiny IDs โš ๏ธ Large tree โŒ Full images โŒ Full images โš ๏ธ Query results
Cost per Action โœ… $0 โœ… $0 โŒ ~$0.01-0.05 โŒ ~$0.01-0.05 โŒ API fees
Offline/Local โœ… 100% local โœ… Local โš ๏ธ Needs LLM API โš ๏ธ Needs LLM API โŒ Cloud only

๐ŸŽฏ Key Advantages

1. Token Efficiency = Faster + Cheaper

Other tools send:        BrowserControl sends:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€      โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Full DOM tree            "click(5)"
(5,000+ tokens)          (3 tokens)
     or
Base64 screenshot        Element ID + summary
(10,000+ tokens)         (100 tokens)

Result: 50-100x fewer tokens per action = faster responses, lower costs.

2. No Extra AI Calls Required

Tool AI Calls per Click
BrowserControl 0 (just click(5))
Stagehand 1-2 (vision + action)
Browser-Use 1-2 (vision + planning)
AgentQL 1 (query interpretation)

Result: No vision API costs, no rate limits, works offline.

3. Developer Tools No One Else Has

# Only BrowserControl can do this:
get_console_logs()      # See browser errors
get_network_requests()  # Monitor API calls  
get_page_errors()       # Catch JS exceptions
run_in_console(code)    # Debug in real-time
inspect_element(5)      # Get computed styles
get_page_performance()  # Core Web Vitals

Other tools: Navigate, click, type... that's it.

4. Session Recording Built-In

start_recording()   โ†’   Browse around   โ†’   stop_recording()
                                              โ†“
                               ๐Ÿ“น session_20260108.zip
                               (View with Playwright trace viewer)

Other tools: No recording. Debug from memory.

5. True Persistence

What Persists BrowserControl Others
Cookies โœ… โŒ
localStorage โœ… โŒ
Session tokens โœ… โŒ
Login state โœ… โŒ
Browser history โœ… โŒ

Result: Log in once, stay logged in across sessions.

6. Simpler Mental Model

โŒ Other tools:
   "Find the button with class 'btn-primary' that contains text 'Submit' 
    and is a descendant of form#contact-form..."

โœ… BrowserControl:
   "click(7)"

๐Ÿ“Š Real-World Performance

Scenario BrowserControl Vision-Based Tools
Click a button ~50ms ~2-5 seconds
Fill a form (5 fields) ~500ms ~15-30 seconds
Navigate + act ~1 second ~5-10 seconds
Debug console errors โœ… Instant โŒ Not possible

๐Ÿ’ฐ Cost Comparison (1000 actions/month)

Tool Monthly Cost
BrowserControl $0 (fully local)
Stagehand (GPT-4V) ~$30-50
Browser-Use (Claude Vision) ~$20-40
AgentQL ~$50+ (API fees)

๐Ÿš€ Quick Start

Installation

# Install with pip
pip install browsercontrol

# Or with uv (recommended)
uv add browsercontrol

# That's it! Chromium is auto-installed on first run

Run the Server

# Using the CLI
browsercontrol

# Or as a module
python -m browsercontrol

# Or with FastMCP
fastmcp run browsercontrol.server:mcp

Connect to Claude Desktop

Add to ~/.config/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "browsercontrol": {
      "command": "browsercontrol"
    }
  }
}

Then just ask Claude:

"Go to GitHub and star the browsercontrol repo"

Claude will navigate, find the star button, and click itโ€”showing you screenshots along the way!


๐ŸŽฏ Features

1. Set of Marks (SoM) - Vision-First Interaction

Every action returns an annotated screenshot with numbered elements. Your AI agent can:

  • See the page exactly as a human would
  • Identify clickable elements by number
  • Act with simple commands like click(5)

2. ๐Ÿ”ง Developer Tools

Built-in debugging tools for web development:

Tool Description
get_console_logs() Capture browser console (errors, warnings, logs)
get_network_requests() Monitor API calls, status codes, timing
get_page_errors() See JavaScript exceptions and crashes
run_in_console(code) Execute JS in browser console
inspect_element(id) Get computed styles, dimensions, properties
get_page_performance() Page load time, Core Web Vitals, memory

3. ๐ŸŽฌ Session Recording

Record browser sessions for debugging and documentation:

Tool Description
start_recording() Begin recording the session
stop_recording() Save recording (Playwright trace format)
take_snapshot() Save screenshot + HTML + URL
list_recordings() View all saved sessions

View recordings with:

npx playwright show-trace ~/.browsercontrol/recordings/session.zip

4. ๐Ÿ’พ Persistent Sessions

  • Cookies, localStorage, and session data persist across restarts
  • Stay logged into websites
  • Maintain shopping carts, preferences, etc.

๐Ÿ› ๏ธ Available Tools

Navigation

Tool Description
navigate_to(url) Go to a URL
go_back() Navigate back
go_forward() Navigate forward
refresh_page() Reload the page
scroll(direction, amount) Scroll the page

Interaction

Tool Description
click(element_id) Click element by number
click_at(x, y) Click at coordinates
type_text(element_id, text) Type into input
press_key(key) Press keyboard key (Enter, Tab, etc.)
hover(element_id) Hover over element
scroll_to_element(element_id) Scroll element into view
wait(seconds) Wait for loading

Forms

Tool Description
select_option(element_id, option) Select dropdown option
check_checkbox(element_id) Toggle checkbox

Content

Tool Description
get_page_content() Get page as markdown
get_text(element_id) Get element text
get_page_info() Get URL and title
run_javascript(script) Execute JavaScript
screenshot(annotate, full_page) Take screenshot

Developer Tools

Tool Description
get_console_logs() Browser console output
get_network_requests() API calls and responses
get_page_errors() JavaScript errors
run_in_console(code) Execute JS in console
inspect_element(id) Element styles/properties
get_page_performance() Load times, Web Vitals

Recording

Tool Description
start_recording() Begin session recording
stop_recording() Save recording
take_snapshot() Save screenshot + HTML
list_recordings() View saved sessions

โš™๏ธ Configuration

Configure via environment variables:

Variable Default Description
BROWSER_HEADLESS true Run without visible window
BROWSER_VIEWPORT_WIDTH 1280 Viewport width in pixels
BROWSER_VIEWPORT_HEIGHT 720 Viewport height in pixels
BROWSER_TIMEOUT 30000 Navigation timeout (ms)
BROWSER_USER_DATA_DIR ~/.browsercontrol/user_data Browser profile path
BROWSER_EXTENSION_PATH - Path to browser extension
LOG_LEVEL INFO Logging level (DEBUG, INFO, WARNING, ERROR)

Examples

# Run with visible browser (for debugging)
BROWSER_HEADLESS=false browsercontrol

# Custom viewport for mobile testing
BROWSER_VIEWPORT_WIDTH=375 BROWSER_VIEWPORT_HEIGHT=812 browsercontrol

# Verbose logging
LOG_LEVEL=DEBUG browsercontrol

๐Ÿ“š Examples

Example 1: Web Research

You: "Go to Wikipedia and find out when Python was created"

Claude: *navigates to wikipedia.org*
        *types "Python programming language" in search*
        *clicks search result*
        *reads the page*
        
        "Python was created by Guido van Rossum and first released in 1991."

Example 2: Debug a Web App

You: "Go to my app at localhost:3000 and check for any errors"

Claude: *navigates to localhost:3000*
        *calls get_console_logs()*
        
        "I found 2 errors:
         [ERROR] Uncaught TypeError: Cannot read property 'map' of undefined
         [ERROR] Failed to load resource: 404 /api/users"
        
        *calls get_network_requests()*
        
        "The /api/users endpoint returned 404. Check your API route."

Example 3: Automated Testing

You: "Test the login flow on my site. Start recording."

Claude: *calls start_recording()*
        *navigates to site*
        *types email in element 3*
        *types password in element 4*
        *clicks Login button (element 5)*
        *verifies successful redirect*
        *calls stop_recording()*
        
        "Login test passed! Recording saved to session_20260108.zip"

Example 4: Form Automation

You: "Fill out the contact form on example.com"

Claude: *navigates to example.com/contact*
        *types name in element 2*
        *types email in element 3*
        *types message in element 4*
        *clicks Submit (element 5)*
        
        "Form submitted successfully!"

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   AI Agent      โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  BrowserControl  โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚   Browser   โ”‚
โ”‚ (Claude/Gemini) โ”‚โ—€โ”€โ”€โ”€โ”€โ”‚   MCP Server     โ”‚โ—€โ”€โ”€โ”€โ”€โ”‚ (Chromium)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                        โ”‚                      โ”‚
        โ”‚   "click(5)"           โ”‚   mouse.click()      โ”‚
        โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
        โ”‚   [annotated           โ”‚   [screenshot +      โ”‚
        โ”‚    screenshot]         โ”‚    element map]      โ”‚

How It Works

  1. AI sends command: click(5)
  2. Server finds element: Looks up element #5 from the last screenshot
  3. Browser acts: Clicks at the element's coordinates
  4. Capture state: Takes new screenshot, detects elements
  5. Annotate: Draws numbered boxes on interactive elements
  6. Return to AI: Sends annotated image + element list

๐Ÿ“ฆ Project Structure

browsercontrol/
โ”œโ”€โ”€ __init__.py          # Package exports
โ”œโ”€โ”€ __main__.py          # CLI entry point
โ”œโ”€โ”€ server.py            # MCP server setup
โ”œโ”€โ”€ browser.py           # BrowserManager with SoM
โ”œโ”€โ”€ config.py            # Environment configuration
โ””โ”€โ”€ tools/
    โ”œโ”€โ”€ navigation.py    # Navigation tools
    โ”œโ”€โ”€ interaction.py   # Click, type, hover tools
    โ”œโ”€โ”€ forms.py         # Form handling tools
    โ”œโ”€โ”€ content.py       # Content extraction tools
    โ”œโ”€โ”€ devtools.py      # Developer tools
    โ””โ”€โ”€ recording.py     # Session recording tools

๐Ÿ”ง Troubleshooting

"Missing X server" Error

Set BROWSER_HEADLESS=true or run with xvfb:

xvfb-run browsercontrol

Browser Not Starting

Chromium auto-installs on first run. If it fails, install manually:

python -m playwright install chromium

Session Not Persisting

Check that BROWSER_USER_DATA_DIR is writable:

ls -la ~/.browsercontrol/

Connection Refused

Ensure no other instance is running:

pkill -f browsercontrol
browsercontrol

๐Ÿค Contributing

Contributions are welcome! Some ideas:

  • Multi-tab support
  • Firefox/WebKit support
  • DOM diffing (detect changes)
  • Accessibility audit
  • Mobile emulation presets
  • Cookie import/export
# Clone and install
git clone https://github.com/adityasasidhar/browsercontrol
cd browsercontrol
uv sync

# Run tests
uv run pytest

# Run in development
uv run fastmcp dev browsercontrol/server.py

๐Ÿ“„ License

MIT License - Use it however you want.


๐Ÿ™ Acknowledgments

  • Inspired by the browser control capabilities in Google's AntiGravity IDE
  • Built with FastMCP and Playwright
  • Thanks to the MCP community for making AI-tool integration accessible

Built with โค๏ธ for the AI agent community.

โญ Star on GitHub โ€ข Report Bug โ€ข Request Feature

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

browsercontrol-0.1.1.tar.gz (543.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

browsercontrol-0.1.1-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file browsercontrol-0.1.1.tar.gz.

File metadata

  • Download URL: browsercontrol-0.1.1.tar.gz
  • Upload date:
  • Size: 543.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2025.4","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for browsercontrol-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bea31f9dfca8ddc6d566a5a34e5b6ae0feb51673cbef2a248288f2a2a26fd773
MD5 ee1e165b11676736e86a1f36364137d5
BLAKE2b-256 49fc4cb2da56fef0ee6c0eff1b1287e07d462fd579648426c99cb3eaed4dcd0d

See more details on using hashes here.

File details

Details for the file browsercontrol-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: browsercontrol-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2025.4","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for browsercontrol-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b03ecea0f5b4398b71bfc16f77339966bb7070f61f3192d53c1bec7ea630ef16
MD5 4f3d040adcc00dfdbfb69a0005ac9b75
BLAKE2b-256 b2afde6b905064c662cd466e9a1bc785e119561bc15b2243007e8a36b21a2463

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page