Skip to main content

Give an LLM a URL and a goal — it drives a real browser, fills forms, and returns structured data. The browser that scripts itself.

Project description

🦾 Browsewright

The browser that scripts itself.

Give an LLM a URL and a goal. It drives a real Chrome, fills out forms, gets past bot walls, and hands you structured data not raw HTML.

CI PyPI Python License: MIT PRs welcome Stars


Playwright automates a browser you script. Browsewright is the browser that scripts itself.

You don't write selectors. You don't maintain scrapers that break every time a site ships a redesign. You give it intent — "find the pricing", "enrich this lead", "fill out this form" — and an LLM drives a real browser to get it done.

pip install browsewright
bw "https://stripe.com" "what does this company do and who is it for"
============================================================
RESULT  [api]  412 tokens  3.1s
------------------------------------------------------------
Stripe is financial infrastructure for the internet. It provides
payment processing, billing, and treasury APIs for businesses from
startups to enterprises like Amazon and Shopify...
============================================================

🤯 It doesn't just read the web. It does things on the web.

Most "AI scrapers" hand you text. Browsewright acts. Point it at a real government records form with no API, give it a profile, and walk away:

bw-tasks form \
  "https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx" \
  --profile examples/sample_profile.json

It read the field labels, mapped your profile onto the form with an LLM, picked valid dropdown options, submitted it, and came back with:

Page 1 of 815 results — real names and dates, extracted as JSON.

No selectors. No XPath. No API. The form has none — it's a 20-year-old ASP.NET page that's invisible to every HTTP scraper. Browsewright drives it like a human.


💸 And it's almost free

Benchmark — 50 real, diverse websites in one run: 50 / 50 extracted successfully · $0.047 total · ~1,200 tokens & ~20s median per site. 28% were answered by the free API/archive shortcut with no browser at all. (Reproduce it: python examples/batch_test.py.)

It tries the cheapest path first — open APIs, RSS, public archives — and only spins up Chrome when a page actually needs it. You pay pennies for the easy 80% and a real browser for the hard 20%.


How it stacks up

Browsewright Firecrawl Browser-Use Tavily
Returns structured JSON from intent ⚠️ scripted
Fills & submits real forms
Drives a real Chrome (human motor layer)
Gets past Cloudflare/DataDome bot walls ⚠️ ⚠️
Free API/archive shortcut before any browser
Runs fully local, your own API key ❌ SaaS ❌ SaaS
5 ready-made business tasks built in
MIT, self-hostable partial

Comparisons reflect typical default usage; all four are good tools. Browsewright's bet is intent in → action + structured data out, run locally for pennies.


Install

pip install browsewright          # core
pip install "browsewright[mcp]"   # + MCP server (Claude Desktop / Code / any client)

Or from source:

git clone https://github.com/krishnashakula/browsewright && cd browsewright
python -m venv .venv && . .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -e .

Add your Anthropic API key:

cp .env.example .env
# edit .env and paste your key from https://console.anthropic.com/settings/keys

The first browser run launches Chrome via nodriver (Chrome must be installed).

bw "not recognized" after install? pip put the scripts in a folder that isn't on your PATH (common on Windows). Use the module form, which always works: python -m browsewright "<url>" "<goal>" · python -m browsewright.tasks_cli enrich "<url>"


Use it

CLI

bw "https://news.ycombinator.com" "the top story right now"
bw "https://example.com" "find the pricing" --json
bw "https://example.com" "debug this" --no-headless --verbose

Python

import asyncio
from browsewright import search

res = asyncio.run(search("https://stripe.com", "what does this company do"))
print(res.answer)         # synthesized answer
print(res.stage)          # "api" | "browser" | "common_crawl" | "blocked" | "error"
print(res.tokens_total, res.elapsed_s)

As an MCP tool (Claude Desktop / Claude Code / any MCP client)

{ "mcpServers": { "browsewright": { "command": "bw-mcp" } } }

Your LLM now has a read_page(url, goal) tool.


The 5 built-in tasks — bw-tasks

One pipeline — fetch → structured extract (JSON) → diff/aggregate → action — exposed as five business workflows. Each is a CLI subcommand and a library function.

Task Command Output
🕵️ Competitor watch bw-tasks watch <url> Baseline now, change alerts later
🎯 Lead enrichment bw-tasks enrich <url> CRM fields + a personalized cold-email line
📝 Agentic form fill bw-tasks form <url> --profile p.json Understands fields, fills, submits, reads results
💰 Price/stock tracking bw-tasks track <url> Price & availability change alerts
📣 Brand monitoring bw-tasks brand <name> <urls…> Mentions + sentiment digest

Common flags: --json, --out FILE, --slack <webhook>, --no-headless, --aggressive.

Real enrich output (trimmed):

{
  "company_name": "Tavily",
  "industry": "AI/SaaS - Developer Tools",
  "tech_stack_or_integrations": ["OpenAI", "Anthropic", "Groq", "Databricks"],
  "recent_news_or_signals": ["Raised $25M Series A", "Databricks MCP partnership"],
  "icp_fit_score_1_to_10": 7,
  "personalized_cold_email_first_line": "I noticed Tavily just partnered with Databricks on the MCP Marketplace—looks like you're doubling down on enterprise adoption after your $25M Series A."
}

Build your own task with the core primitive

Every task is a thin wrapper over extract_structured(url, schema). Define any schema, get JSON back:

import asyncio
from browsewright import extract_structured

schema = {"headline": "string",
          "open_roles": [{"title": "string", "team": "string", "location": "string"}]}
data = asyncio.run(extract_structured(
    "https://example.com/careers", schema,
    instruction="Extract the page headline and every open job posting."))
print(data["open_roles"])

Scheduling

Tasks are single-shot; snapshot/diff state persists between runs, so change detection works across invocations. Run on cron, n8n/Make/Zapier, or /loop:

# every 6h, alert on competitor pricing changes
0 */6 * * * bw-tasks watch "https://competitor.com/pricing" --slack https://hooks.slack.com/services/XXX

How it works

search(url, goal)
   │
   ├─ Polite gate ........ robots.txt check + per-host rate limit
   │
   ├─ Pre-flight pipeline (cheapest path first)
   │     1. Common Crawl ... public archive            (opt-in)
   │     2. Open API ....... RSS / wp-json / *.json     (no browser, ~1.5k tokens)
   │     3. Origin IP ...... CDN bypass                 (skipped in polite mode)
   │     4. Classifier ..... detect Cloudflare/Akamai/DataDome/…
   │
   └─ Browser session (only if no shortcut hit)
         • real headless Chrome via nodriver (native TLS fingerprint)
         • human motor layer — Bézier mouse, typing cadence, scroll pacing
         • LLM decides actions only at junctions (~1 call/page)
         • blind-scene shortcut: extract directly when the DOM scan is blocked
         • visual recovery: a vision call clears interstitials/challenges

Polite by default

Polite mode is the default and what you should ship. It checks robots.txt, rate-limits per host, and does not bypass CDN bot protection. --aggressive (polite=False) enables origin-IP discovery and ignores robots — use it only on targets you own or are authorized to test.

⚠️ You are responsible for complying with each site's Terms of Service, applicable law (CFAA and equivalents), and data-protection rules (GDPR/CCPA). Browsewright is for authorized research, your own properties, and sites whose terms permit automated access. The authors accept no liability for misuse.


⭐ Star it / contribute

If Browsewright saved you a scraper, drop a star — it's the whole reason this is open source. Issues and PRs welcome: pre-flight vendors, new tasks, more sites in the benchmark.

MIT licensed. Built on nodriver

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

browsewright-0.1.0.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

browsewright-0.1.0-py3-none-any.whl (56.0 kB view details)

Uploaded Python 3

File details

Details for the file browsewright-0.1.0.tar.gz.

File metadata

  • Download URL: browsewright-0.1.0.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for browsewright-0.1.0.tar.gz
Algorithm Hash digest
SHA256 97d70a7e97be4cf7f72ba3cf9f1d8d758dff0e53b458d3745e84a17a4268a6fc
MD5 02f67356748d0a93520b993f0ddaed6e
BLAKE2b-256 a9f9daea876849f61288d1709e59cdd0e9f411c02d0bcbbf97a8eb3c11bf7746

See more details on using hashes here.

Provenance

The following attestation bundles were made for browsewright-0.1.0.tar.gz:

Publisher: publish.yml on krishnashakula/browsewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file browsewright-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: browsewright-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 56.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for browsewright-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4a437044d799e19c26a12c3d84369e657795343c027bd11a7efa5a16f1bf608
MD5 1d558a3247271207b2f31b90ad300954
BLAKE2b-256 4a04b969b5f78595bfedf36c8fa16c6db43232e0efd59914ba5c5fba600009da

See more details on using hashes here.

Provenance

The following attestation bundles were made for browsewright-0.1.0-py3-none-any.whl:

Publisher: publish.yml on krishnashakula/browsewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page