Skip to main content

Python SDK for the Webclaw web extraction API

Project description

webclaw

Python SDK for the Webclaw web extraction API

PyPI Python License


Note: The webclaw Cloud API is public. Create an API key at webclaw.io or use the open-source CLI/MCP for local extraction.


Installation

pip install webclaw

Requires Python 3.9+. The only dependency is httpx.

Quick Start

Sync

from webclaw import Webclaw

client = Webclaw("wc-YOUR_API_KEY")

result = client.scrape("https://example.com", formats=["markdown"])
print(result.markdown)

Async

from webclaw import AsyncWebclaw

async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com", formats=["markdown"])
    print(result.markdown)

Both clients support identical method signatures. Every sync method has an async equivalent. The examples below use the sync client for brevity.

Endpoints

Scrape

Extract content from a single URL. Supports multiple output formats: "markdown", "text", "llm", "json".

result = client.scrape(
    "https://example.com",
    formats=["markdown", "text", "llm"],
    include_selectors=["article", ".content"],
    exclude_selectors=["nav", "footer"],
    only_main_content=True,
    no_cache=True,
)

result.url        # str
result.markdown   # str | None
result.text       # str | None
result.llm        # str | None
result.json_data  # Any | None
result.metadata   # dict
result.cache      # CacheInfo | None  (.status: "hit" | "miss" | "bypass")
result.warning    # str | None

Vertical extractors

28 site-specific extractors that return typed JSON (GitHub, Reddit, Amazon, YouTube, PyPI, HuggingFace, Trustpilot, etc.) instead of generic markdown. See the catalog for the full list.

# Discover available extractors
catalog = client.list_extractors()
for e in catalog["extractors"]:
    print(e["name"], "-", e["label"])

# Run a specific extractor
pr = client.scrape_vertical(
    "github_pr",
    "https://github.com/rust-lang/rust/pull/123456",
)
print(pr["data"])  # {title, state, author, commits, reviews, ...}

# Amazon product as typed JSON
product = client.scrape_vertical(
    "amazon_product",
    "https://www.amazon.com/dp/B0C6KKQ7ND",
)
print(product["data"]["price"], product["data"]["rating"])

The data field is extractor-specific; call list_extractors() to discover what each returns. Both methods have async equivalents on AsyncWebclaw.

Search

Web search with optional topic filtering.

results = client.search("web scraping tools 2026", num_results=10, topic="tech")

for r in results["results"]:
    print(r["title"], r["url"])

Parameters: query (str), num_results (int, optional), topic (str, optional).

Map

Discover URLs via sitemap.

result = client.map("https://example.com")

print(result.count)
for url in result.urls:
    print(url)

Batch

Scrape multiple URLs in parallel.

result = client.batch(
    ["https://a.com", "https://b.com", "https://c.com"],
    formats=["markdown"],
    concurrency=5,
)

for item in result.results:
    print(item.url, item.markdown, item.error or "ok")

Parameters: urls (list[str]), formats (optional), concurrency (int, default 5).

Endpoints

Discover the API endpoints a page calls at runtime by scanning its inline JavaScript and external <script src> bundles. This surfaces the routes a single-page app hits that map (sitemap-based) can't see: relative paths, absolute URLs, GraphQL operations, and WebSocket endpoints.

result = client.endpoints(
    "https://app.example.com",
    include_third_party=False,  # default: skip analytics/CDN hosts
    max_bundles=20,             # default & server max: external scripts to scan
)

print(result.bundles_scanned, result.endpoint_count, result.truncated)
print(result.hosts)  # list[str] of hosts seen across endpoints

for e in result.endpoints:
    print(e.kind, e.value, "first-party" if e.first_party else "third-party", "via", e.source)

Each endpoint's kind is one of "relative_path", "absolute_url", "graph_ql", "web_socket" (the values in EndpointKind). truncated is True when more bundles existed than max_bundles allowed.

Parameters: url (str), include_third_party (bool, default False), max_bundles (int, default 20, capped at 20).

Extract

LLM-powered structured data extraction. Use either a JSON schema or a natural language prompt.

# Schema-based extraction
result = client.extract(
    "https://example.com/pricing",
    schema={
        "type": "object",
        "properties": {
            "plans": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "string"},
                    },
                },
            }
        },
    },
)
print(result.data)  # dict matching your schema

# Prompt-based extraction
result = client.extract(
    "https://example.com/pricing",
    prompt="Extract all pricing tiers with names and monthly prices",
)
print(result.data)

Summarize

Summarize page content with an optional sentence limit.

result = client.summarize("https://example.com", max_sentences=3)
print(result.summary)

Diff

Detect content changes at a URL since the last check.

result = client.diff("https://example.com/status")

print(result["has_changed"])  # bool
print(result["diff"])         # str, unified diff of changes

Brand

Extract brand identity (colors, fonts, logos) from a URL.

result = client.brand("https://example.com")
print(result.data)  # dict with brand identity fields

Research

Deep research that searches, reads, and synthesizes information from multiple sources. This is an async job: the SDK starts it and polls until completion.

# Blocks until research completes (up to 600s, or 1200s with deep=True)
result = client.research(
    "How do modern web crawlers handle JavaScript rendering?",
    max_sources=15,
    deep=True,
    topic="tech",
)

print(result.report)
print(result.iterations)
print(result.elapsed_ms)

for source in result.sources:
    print(source["url"], source["title"])

To check status without blocking:

status = client.get_research_status("job-id-here")
print(status.status)  # "running" | "completed" | "failed"

Parameters: query (str), deep (bool, default False), max_sources (int, optional), max_iterations (int, optional), topic (str, optional).

Crawl

Start an async crawl that follows links from a seed URL.

job = client.crawl(
    "https://example.com",
    max_depth=3,
    max_pages=100,
    use_sitemap=True,
)

# Poll until complete (default timeout 300s)
status = job.wait(interval=2.0, timeout=300.0)

print(status.total, status.completed, status.errors)
for page in status.pages:
    print(page.url, len(page.markdown or ""))

Check status without waiting:

status = job.get_status()
print(status.status)  # "running" | "completed" | "failed"

Async variant:

job = await client.crawl("https://example.com", max_depth=2)
status = await job.wait()

Watch

Monitor URLs for content changes with automatic periodic checks.

Create a watch:

watch = client.watch_create(
    "https://example.com/pricing",
    name="Pricing page monitor",
    interval_minutes=60,
    webhook_url="https://hooks.example.com/webclaw",
)
print(watch.id, watch.status)

List all watches:

result = client.watch_list(limit=50, offset=0)
for w in result.watches:
    print(w.id, w.url, w.name, w.last_checked)
print(result.total)

Get a single watch:

watch = client.watch_get("watch-id-here")
print(watch.url, watch.interval_minutes)

Delete a watch:

client.watch_delete("watch-id-here")

Trigger an immediate check:

check = client.watch_check("watch-id-here")
print(check.has_changed)  # bool
print(check.diff)         # str | None
print(check.checked_at)   # ISO timestamp

Error Handling

All errors inherit from WebclawError, which carries the HTTP status code when available.

from webclaw import (
    WebclawError,
    AuthenticationError,
    NotFoundError,
    RateLimitError,
    TimeoutError,
)

try:
    result = client.scrape("https://example.com")
except AuthenticationError:
    print("Invalid or missing API key")
except RateLimitError:
    print("Too many requests, slow down")
except NotFoundError:
    print("Resource not found")
except TimeoutError as e:
    print(f"Operation timed out: {e}")
except WebclawError as e:
    print(f"API error (status {e.status_code}): {e}")
Exception HTTP Status When
AuthenticationError 401 / 403 Invalid or missing API key
NotFoundError 404 Resource does not exist
RateLimitError 429 Too many requests
TimeoutError -- Crawl/research polling exceeded timeout
WebclawError Any Base class for all other API errors

Configuration

import os
from webclaw import Webclaw

client = Webclaw(
    os.environ["WEBCLAW_API_KEY"],
    base_url="https://api.webclaw.io",  # default
    timeout=60.0,                        # seconds, default 30
)

Both Webclaw and AsyncWebclaw support context managers for automatic cleanup:

# Sync
with Webclaw("wc-YOUR_API_KEY") as client:
    result = client.scrape("https://example.com")

# Async
async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com")

Async Usage

Every endpoint is available on AsyncWebclaw with identical parameters. Use await on all method calls and async with for the context manager.

import asyncio
from webclaw import AsyncWebclaw

async def main():
    async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
        # Run multiple scrapes concurrently
        results = await asyncio.gather(
            client.scrape("https://a.com", formats=["markdown"]),
            client.scrape("https://b.com", formats=["markdown"]),
            client.scrape("https://c.com", formats=["markdown"]),
        )
        for r in results:
            print(r.url, len(r.markdown or ""))

asyncio.run(main())

Type Support

This package ships with a py.typed marker (PEP 561). Type checkers like mypy and pyright will pick up all type annotations automatically. All response types are dataclasses importable from the top-level package:

from webclaw import ScrapeResponse, CrawlStatus, MapResponse, ExtractResponse, EndpointsResponse

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webclaw-0.4.0.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webclaw-0.4.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file webclaw-0.4.0.tar.gz.

File metadata

  • Download URL: webclaw-0.4.0.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for webclaw-0.4.0.tar.gz
Algorithm Hash digest
SHA256 0e93435beb6d91cdefef8b8d0c63aec8719c78c081df025fbd6ae5b59aa2a19a
MD5 0f7550ae6077b6b438db0f74489d8a34
BLAKE2b-256 03b9fcaa282d2ee7577142294fcb5b42c3ce69c7eaf15573d7bdb9dd3b18dccd

See more details on using hashes here.

File details

Details for the file webclaw-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: webclaw-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for webclaw-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73e100c2f338a5ce45c7b40d0ee38023ce066f14d2262654fbbb59b41b1b5b19
MD5 131734b5cb179b8b95e90d3f05df5b6e
BLAKE2b-256 cae44bbef766058cc54c944e27606ab9ec6c64f5730c606c8509e9cb004b94c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page