Skip to main content

Local browser automation + Web Speed API integration for authenticated web extraction

Project description

web-speed-agent

Local browser automation + Web Speed API integration for authenticated web extraction.

Point an AI agent at any website — including ones that require login — and get back clean, structured data. Credentials stay on your machine. Only extracted HTML goes to the server.

pip install web-speed-agent
playwright install chromium

How it works

Your machine                           Web Speed server
─────────────────────────────────      ──────────────────────────
Playwright browser (local)
  ↓ navigates, logs in, clicks
  ↓ gets page HTML
  ↓ (no passwords sent)
agent.extract(html)         ────────→  Advanced extraction engine
                            ←────────  Structured JSON

Credentials never leave your machine. The server only sees HTML.


Quickstart

import asyncio
from web_speed_agent import Agent

async def main():
    agent = Agent(api_key="wsp_...")       # or set WEBSPEED_API_KEY env var

    # Public pages — no browser needed
    result = await agent.map("https://techcrunch.com/some-article/")
    print(result["article"]["sections"])

    # Authenticated pages — browser runs locally
    agent.store_credential("mysite", "me@example.com", "mypassword")

    async with agent.browser(session_name="mysite") as browser:
        page = await browser.new_page()
        await page.goto("https://mysite.com/login")

        username, password = agent.get_credential("mysite")
        await page.fill('[name="email"]', username)
        await page.fill('[name="password"]', password)
        await page.click('button[type="submit"]')
        await page.wait_for_load_state("networkidle")

        # Now on a logged-in page — extract it
        html = await page.content()
        result = await agent.extract(html, page_type="listing")
        print(result["listing"]["items"])

asyncio.run(main())

Get an API key at getwebspeed.io.


Installation

Requirements: Python 3.10+, a Web Speed API key

pip install web-speed-agent
playwright install chromium
export WEBSPEED_API_KEY="wsp_..."

Core concepts

Agent

The main class. Manages credentials, browser sessions, and API calls.

from web_speed_agent import Agent

# API key from argument
agent = Agent(api_key="wsp_...")

# API key from environment variable (recommended)
# export WEBSPEED_API_KEY="wsp_..."
agent = Agent()

# Use as async context manager (auto-closes HTTP client)
async with Agent() as agent:
    ...

Extracting public pages

No browser needed for pages that don't require login:

# Fetch + extract in one call
result = await agent.map("https://example.com/article")

# With JavaScript rendering (for heavy SPAs)
result = await agent.map("https://example.com/spa", js=True)

Extracting authenticated pages

Use a local browser session. The browser runs on your machine:

async with agent.browser(session_name="mysite") as browser:
    page = await browser.new_page()
    await page.goto("https://mysite.com/dashboard")
    html = await page.content()

result = await agent.extract(html)

The session_name persists cookies to ~/.webspeed/sessions/<name>/ so subsequent runs skip the login step.


Credential management

Credentials are stored in your system keychain (macOS Keychain, Windows Credential Manager, Linux secret-tool). They are never sent to Web Speed servers.

# Store once
agent.store_credential("mysite", "me@example.com", "mypassword")

# Retrieve anywhere
username, password = agent.get_credential("mysite")

# Remove
agent.delete_credential("mysite")

Extraction output

The server returns page-type-aware structured data:

# Article
result = await agent.extract(html, page_type="article")
# result["page_type"]    → "article"
# result["title"]        → "Article Title"
# result["author"]       → "Jane Smith"
# result["published_date"] → "2026-05-06"
# result["article"]["sections"] → [{"heading": "...", "paragraphs": [...]}]
# result["article"]["links"]    → [{"text": "...", "url": "..."}]

# Product
result = await agent.extract(html, page_type="product")
# result["product"]["name"]         → "Wireless Headphones"
# result["product"]["price"]        → "$99.99"
# result["product"]["availability"] → "In Stock"
# result["product"]["rating"]       → "4.5"
# result["product"]["specs"]        → {"Battery": "30h", ...}

# Listing (search results, category pages)
result = await agent.extract(html, page_type="listing")
# result["listing"]["items"] → [{"title": "...", "url": "...", "price": "..."}]

# Auto-detect (default)
result = await agent.extract(html)
# result["page_type"] → "article" | "product" | "listing" | "other"

All results include engine: "advanced" — 60–85% more token-efficient than raw HTML.


Examples

Price monitor

import asyncio
from web_speed_agent import Agent

async def check_price(url: str, site_name: str) -> str:
    async with Agent() as agent:
        agent.store_credential(site_name, "me@example.com", "password", overwrite=True)

        async with agent.browser(session_name=site_name) as browser:
            page = await browser.new_page()

            # Login
            await page.goto(f"https://{site_name}.com/login")
            user, pwd = agent.get_credential(site_name)
            await page.fill('[name="email"]', user)
            await page.fill('[name="password"]', pwd)
            await page.click('button[type="submit"]')
            await page.wait_for_load_state("networkidle")

            # Check product
            await page.goto(url)
            await page.wait_for_load_state("networkidle")
            html = await page.content()

        result = await agent.extract(html, page_type="product")
        return result.get("product", {}).get("price", "unknown")

price = asyncio.run(check_price("https://example.com/product/123", "example"))
print(f"Current price: {price}")

Read a private dashboard

import asyncio
from web_speed_agent import Agent

async def get_dashboard_data():
    async with Agent() as agent:
        async with agent.browser(session_name="analytics") as browser:
            page = await browser.new_page()

            # Login (first run only — session persists after)
            creds = agent.get_credential("analytics")
            if not creds:
                agent.store_credential("analytics", "me@company.com", "password")
                creds = agent.get_credential("analytics")

            await page.goto("https://analytics.company.com/login")
            await page.fill('[name="email"]', creds[0])
            await page.fill('[name="password"]', creds[1])
            await page.click('button[type="submit"]')
            await page.wait_for_load_state("networkidle")

            # Navigate to dashboard
            await page.goto("https://analytics.company.com/dashboard")
            await page.wait_for_selector(".metrics-table", timeout=10000)
            html = await page.content()

        result = await agent.extract(html)
        return result

asyncio.run(get_dashboard_data())

Multi-page scrape while logged in

import asyncio
from web_speed_agent import Agent

async def scrape_inbox():
    async with Agent() as agent:
        async with agent.browser(session_name="webmail") as browser:
            page = await browser.new_page()

            # Login
            await page.goto("https://mail.example.com/login")
            user, pwd = agent.get_credential("webmail")
            await page.fill('[name="username"]', user)
            await page.fill('[name="password"]', pwd)
            await page.click('[type="submit"]')
            await page.wait_for_load_state("networkidle")

            # Scrape multiple pages
            emails = []
            for page_num in range(1, 4):
                await page.goto(f"https://mail.example.com/inbox?page={page_num}")
                await page.wait_for_load_state("networkidle")
                html = await page.content()
                result = await agent.extract(html, page_type="listing")
                emails.extend(result.get("listing", {}).get("items", []))

        return emails

asyncio.run(scrape_inbox())

AI agent integration (MCP)

The included MCP server lets Claude Desktop, Gemini CLI, and any MCP-compatible agent use the SDK directly. The agent can log in, navigate, click, and extract — all through natural language.

Start the MCP server:

WEBSPEED_API_KEY="wsp_..." python3 agent_mcp_server.py

Add to Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "web-speed-agent": {
      "command": "python3",
      "args": ["/path/to/agent_mcp_server.py"],
      "env": {
        "WEBSPEED_API_KEY": "wsp_..."
      }
    }
  }
}

Add to Gemini CLI (~/.gemini/settings.json):

{
  "mcpServers": {
    "web-speed-agent": {
      "command": "python3.11",
      "args": ["/path/to/agent_mcp_server.py"],
      "env": {
        "WEBSPEED_API_KEY": "wsp_...",
        "PYTHONPATH": "/path/to/web-speed-agent"
      }
    }
  }
}

Then tell the agent:

"Store my credentials for united — username me@example.com, password mypassword"

"Log into united.com and find me the cheapest flight from SFO to JFK next Friday"

Available MCP tools:

Tool Description
store_credential Save login to system keychain
login Open browser + sign in
navigate Go to a URL in the active session
extract_page Get structured data from current page
click Click a button or link
fill_field Type into a form field
submit_form Submit a form
close_browser End the browser session
account_info Check API credit balance

API reference

Agent

Agent(
    api_key: str | None = None,
    server_url: str | None = None,
    config_dir: str = "~/.webspeed",
    headless: bool = True,
)
Parameter Description
api_key Web Speed API key. Falls back to WEBSPEED_API_KEY env var.
server_url Override API server URL. Default: https://api.getwebspeed.io.
config_dir Directory for config, sessions, and logs. Default: ~/.webspeed.
headless Run browser headlessly. Default: True.

agent.browser()

agent.browser(
    session_name: str | None = None,
    headless: bool | None = None,
    proxy: str | None = None,
) -> ManagedBrowser

Returns an async context manager. Inside the block, call .new_page() to get a Playwright Page.

Parameter Description
session_name Persist cookies to ~/.webspeed/sessions/<name>/. None = no persistence.
headless Override instance headless for this session.
proxy Proxy URL e.g. "socks5://localhost:1080".

Session names must be alphanumeric + hyphens/underscores, max 64 chars.


agent.extract()

await agent.extract(
    html: str,
    page_type: str = "auto",
) -> dict

Sends HTML to the Web Speed API. Costs 1 credit.

Parameter Description
html Raw HTML string (e.g. from page.content()).
page_type "article", "product", "listing", or "auto".

agent.map()

await agent.map(
    url: str,
    js: bool = False,
) -> dict

Fetches and extracts a public URL via the server. No local browser needed. Costs 1 credit.

Parameter Description
url Page URL. Must be http:// or https://.
js Render JavaScript before extracting.

agent.account()

await agent.account() -> dict

Returns: credits, tier, status, lifetime (total/hits/misses).


agent.store_credential()

agent.store_credential(
    site: str,
    username: str,
    password: str,
    overwrite: bool = False,
) -> None

Saves to system keychain. Raises CredentialError if credential exists and overwrite=False.


agent.get_credential()

agent.get_credential(site: str) -> tuple[str, str] | None

Returns (username, password) or None if not found.


agent.delete_credential()

agent.delete_credential(site: str) -> None

Removes credential from keychain.


Exceptions

from web_speed_agent import (
    WebSpeedError,          # Base exception
    AuthenticationError,    # Invalid/missing API key
    InsufficientCreditsError, # No credits remaining
    APIError,               # API returned 4xx/5xx
    RateLimitError,         # 429 Too Many Requests
    CredentialError,        # Keychain error
    BrowserError,           # Playwright error
    NetworkError,           # Timeout or DNS failure
    PlaywrightNotInstalledError, # Run: playwright install chromium
)
from web_speed_agent import Agent, InsufficientCreditsError, NetworkError

try:
    result = await agent.extract(html)
except InsufficientCreditsError:
    print("Out of credits — top up at getwebspeed.io")
except NetworkError as e:
    print(f"Connection failed: {e}")

Configuration

Environment variables

Variable Description
WEBSPEED_API_KEY API key (recommended over config file)
WEBSPEED_SERVER_URL Override server URL (must be https://)

Config file

~/.webspeed/config.yaml — created automatically on first run. Permissions set to 0o600 (owner-only).

api:
  server_url: https://api.getwebspeed.io
  timeout: 30

browser:
  headless: true

Session files

Persisted browser sessions are stored in ~/.webspeed/sessions/<name>/storage.json.

  • Permissions: 0o600 (owner-only)
  • Contains: cookies, localStorage, sessionStorage
  • Safe to delete: agent will re-authenticate on next run

Security

What leaves your machine

When you call agent.extract(html), the page HTML is sent to the Web Speed API for processing. Everything else stays local.

Data Where it goes
Login credentials Never leave your machine (system keychain only)
Browser cookies / session Never leave your machine (local Playwright)
Page HTML Sent over HTTPS to Web Speed API for extraction
Extracted JSON Returned to you

HTML scrubbing (on by default)

Before any HTML is transmitted, the SDK automatically scrubs it locally:

  • Inline <script> and <style> blocks removed
  • Hidden form fields with auth-related names (csrf, token, nonce, session, etc.) have their values blanked
  • Sensitive <meta> content attributes cleared
  • HTML comments removed

Visible content — text, links, tables, headings, product data — is untouched.

# Default: scrubbing is on
result = await agent.extract(html)

# Turn off only if the page has no sensitive data
result = await agent.extract(html, scrub=False)

# Or scrub manually and inspect before sending
from web_speed_agent import scrub
clean_html = scrub(raw_html)
print(clean_html)  # inspect what will be sent
result = await agent.extract(clean_html, scrub=False)

Server-side data handling

  • HTML processed in-memory only — never written to disk, never logged, never cached
  • Auth-gated pages never cached — pages requiring login are explicitly excluded from the shared registry
  • Usage logs store only: a hash of your API key, a hash of the URL (or "sdk-extract"), timestamp, and detected page type — no content
  • No raw HTML in error responses — exceptions are sanitized before any error is returned

Other protections

  • Credentials stored in system keychain, never in files, never sent to servers
  • Session files written with 0o600 permissions (owner-only read/write)
  • Config directory created with 0o700 permissions
  • TLS always verifiedverify=True on all HTTP calls, cannot be disabled
  • HTTPS enforcedserver_url must start with https://, plain HTTP rejected
  • Path traversal prevention — session names validated against [a-zA-Z0-9_-] allowlist
  • No credential logging — passwords never appear in logs or error messages

License

GNU General Public License v3.0 — see LICENSE.

Web Speed API usage is subject to the Web Speed Terms of Service.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_speed_agent-0.1.0.tar.gz (49.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_speed_agent-0.1.0-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file web_speed_agent-0.1.0.tar.gz.

File metadata

  • Download URL: web_speed_agent-0.1.0.tar.gz
  • Upload date:
  • Size: 49.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for web_speed_agent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a973c303259ecf1284fbdac2c74f86c4c35fcdb0a2052eda7c620ce38c23cb1e
MD5 bc91ac16cfc2080fba1a90707bbdeab4
BLAKE2b-256 9d39de557416e6fc59a21538b09e72e89f89fa7ebff7701be18b16f6d60dc5e6

See more details on using hashes here.

File details

Details for the file web_speed_agent-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for web_speed_agent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa6091d01735c34ff6d5a3e36bdb6ba9a48b5611f760907f087a8415d10c9b0c
MD5 57132a1c74026cb62b62669e45d61367
BLAKE2b-256 518b45c5448d958372c0656822aa6ef8344d2093478c5546f32cccc83fb0c41c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page