webclaw

Python SDK for the Webclaw web extraction API

These details have not been verified by PyPI

Project links

Project description

Python SDK for the Webclaw web extraction API

Note: The webclaw Cloud API is public. Create an API key at webclaw.io or use the open-source CLI/MCP for local extraction.

Installation

pip install webclaw

Requires Python 3.9+. The only dependency is httpx.

Quick Start

Sync

from webclaw import Webclaw

client = Webclaw("wc-YOUR_API_KEY")

result = client.scrape("https://example.com", formats=["markdown"])
print(result.markdown)

Async

from webclaw import AsyncWebclaw

async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com", formats=["markdown"])
    print(result.markdown)

Both clients support identical method signatures. Every sync method has an async equivalent. The examples below use the sync client for brevity.

Endpoints

Scrape

Extract content from a single URL. Supports multiple output formats: "markdown", "text", "llm", "json".

result = client.scrape(
    "https://example.com",
    formats=["markdown", "text", "llm"],
    include_selectors=["article", ".content"],
    exclude_selectors=["nav", "footer"],
    only_main_content=True,
    no_cache=True,
)

result.url        # str
result.markdown   # str | None
result.text       # str | None
result.llm        # str | None
result.json_data  # Any | None
result.metadata   # dict
result.cache      # CacheInfo | None  (.status: "hit" | "miss" | "bypass")
result.warning    # str | None

Vertical extractors

28 site-specific extractors that return typed JSON (GitHub, Reddit, Amazon, YouTube, PyPI, HuggingFace, Trustpilot, etc.) instead of generic markdown. See the catalog for the full list.

# Discover available extractors
catalog = client.list_extractors()
for e in catalog["extractors"]:
    print(e["name"], "-", e["label"])

# Run a specific extractor
pr = client.scrape_vertical(
    "github_pr",
    "https://github.com/rust-lang/rust/pull/123456",
)
print(pr["data"])  # {title, state, author, commits, reviews, ...}

# Amazon product as typed JSON
product = client.scrape_vertical(
    "amazon_product",
    "https://www.amazon.com/dp/B0C6KKQ7ND",
)
print(product["data"]["price"], product["data"]["rating"])

The data field is extractor-specific; call list_extractors() to discover what each returns. Both methods have async equivalents on AsyncWebclaw.

Search

Web search with optional topic filtering.

results = client.search("web scraping tools 2026", num_results=10, topic="tech")

for r in results["results"]:
    print(r["title"], r["url"])

Parameters: query (str), num_results (int, optional), topic (str, optional).

Map

Discover URLs via sitemap.

result = client.map("https://example.com")

print(result.count)
for url in result.urls:
    print(url)

Batch

Scrape multiple URLs in parallel.

result = client.batch(
    ["https://a.com", "https://b.com", "https://c.com"],
    formats=["markdown"],
    concurrency=5,
)

for item in result.results:
    print(item.url, item.markdown, item.error or "ok")

Parameters: urls (list[str]), formats (optional), concurrency (int, default 5).

Endpoints

Discover the API endpoints a page calls at runtime by scanning its inline JavaScript and external <script src> bundles. This surfaces the routes a single-page app hits that map (sitemap-based) can't see: relative paths, absolute URLs, GraphQL operations, and WebSocket endpoints.

result = client.endpoints(
    "https://app.example.com",
    include_third_party=False,  # default: skip analytics/CDN hosts
    max_bundles=20,             # default & server max: external scripts to scan
)

print(result.bundles_scanned, result.endpoint_count, result.truncated)
print(result.hosts)  # list[str] of hosts seen across endpoints

for e in result.endpoints:
    print(e.kind, e.value, "first-party" if e.first_party else "third-party", "via", e.source)

Each endpoint's kind is one of "relative_path", "absolute_url", "graph_ql", "web_socket" (the values in EndpointKind). truncated is True when more bundles existed than max_bundles allowed.

Parameters: url (str), include_third_party (bool, default False), max_bundles (int, default 20, capped at 20).

Extract

LLM-powered structured data extraction. Use either a JSON schema or a natural language prompt.

# Schema-based extraction
result = client.extract(
    "https://example.com/pricing",
    schema={
        "type": "object",
        "properties": {
            "plans": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "string"},
                    },
                },
            }
        },
    },
)
print(result.data)  # dict matching your schema

# Prompt-based extraction
result = client.extract(
    "https://example.com/pricing",
    prompt="Extract all pricing tiers with names and monthly prices",
)
print(result.data)

Summarize

Summarize page content with an optional sentence limit.

result = client.summarize("https://example.com", max_sentences=3)
print(result.summary)

Diff

Detect content changes at a URL since the last check.

result = client.diff("https://example.com/status")

print(result["has_changed"])  # bool
print(result["diff"])         # str, unified diff of changes

Brand

Extract brand identity (colors, fonts, logos) from a URL.

result = client.brand("https://example.com")
print(result.data)  # dict with brand identity fields

Research

Deep research that searches, reads, and synthesizes information from multiple sources. This is an async job: the SDK starts it and polls until completion.

# Blocks until research completes (up to 600s, or 1200s with deep=True)
result = client.research(
    "How do modern web crawlers handle JavaScript rendering?",
    max_sources=15,
    deep=True,
    topic="tech",
)

print(result.report)
print(result.iterations)
print(result.elapsed_ms)

for source in result.sources:
    print(source["url"], source["title"])

To check status without blocking:

status = client.get_research_status("job-id-here")
print(status.status)  # "running" | "completed" | "failed"

Parameters: query (str), deep (bool, default False), max_sources (int, optional), max_iterations (int, optional), topic (str, optional).

Crawl

Start an async crawl that follows links from a seed URL.

job = client.crawl(
    "https://example.com",
    max_depth=3,
    max_pages=100,
    use_sitemap=True,
)

# Poll until complete (default timeout 300s)
status = job.wait(interval=2.0, timeout=300.0)

print(status.total, status.completed, status.errors)
for page in status.pages:
    print(page.url, len(page.markdown or ""))

Check status without waiting:

status = job.get_status()
print(status.status)  # "running" | "completed" | "failed"

Async variant:

job = await client.crawl("https://example.com", max_depth=2)
status = await job.wait()

Watch

Monitor URLs for content changes with automatic periodic checks.

Create a watch:

watch = client.watch_create(
    "https://example.com/pricing",
    name="Pricing page monitor",
    interval_minutes=60,
    webhook_url="https://hooks.example.com/webclaw",
)
print(watch.id, watch.status)

List all watches:

result = client.watch_list(limit=50, offset=0)
for w in result.watches:
    print(w.id, w.url, w.name, w.last_checked)
print(result.total)

Get a single watch:

watch = client.watch_get("watch-id-here")
print(watch.url, watch.interval_minutes)

Delete a watch:

client.watch_delete("watch-id-here")

Trigger an immediate check:

check = client.watch_check("watch-id-here")
print(check.has_changed)  # bool
print(check.diff)         # str | None
print(check.checked_at)   # ISO timestamp

Error Handling

All errors inherit from WebclawError, which carries the HTTP status code when available.

from webclaw import (
    WebclawError,
    AuthenticationError,
    NotFoundError,
    RateLimitError,
    TimeoutError,
)

try:
    result = client.scrape("https://example.com")
except AuthenticationError:
    print("Invalid or missing API key")
except RateLimitError:
    print("Too many requests, slow down")
except NotFoundError:
    print("Resource not found")
except TimeoutError as e:
    print(f"Operation timed out: {e}")
except WebclawError as e:
    print(f"API error (status {e.status_code}): {e}")

Exception	HTTP Status	When
`AuthenticationError`	401 / 403	Invalid or missing API key
`NotFoundError`	404	Resource does not exist
`RateLimitError`	429	Too many requests
`TimeoutError`	--	Crawl/research polling exceeded timeout
`WebclawError`	Any	Base class for all other API errors

Configuration

import os
from webclaw import Webclaw

client = Webclaw(
    os.environ["WEBCLAW_API_KEY"],
    base_url="https://api.webclaw.io",  # default
    timeout=60.0,                        # seconds, default 30
)

Both Webclaw and AsyncWebclaw support context managers for automatic cleanup:

# Sync
with Webclaw("wc-YOUR_API_KEY") as client:
    result = client.scrape("https://example.com")

# Async
async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com")

Async Usage

Every endpoint is available on AsyncWebclaw with identical parameters. Use await on all method calls and async with for the context manager.

import asyncio
from webclaw import AsyncWebclaw

async def main():
    async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
        # Run multiple scrapes concurrently
        results = await asyncio.gather(
            client.scrape("https://a.com", formats=["markdown"]),
            client.scrape("https://b.com", formats=["markdown"]),
            client.scrape("https://c.com", formats=["markdown"]),
        )
        for r in results:
            print(r.url, len(r.markdown or ""))

asyncio.run(main())

Type Support

This package ships with a py.typed marker (PEP 561). Type checkers like mypy and pyright will pick up all type annotations automatically. All response types are dataclasses importable from the top-level package:

from webclaw import ScrapeResponse, CrawlStatus, MapResponse, ExtractResponse, EndpointsResponse

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

May 20, 2026

0.2.1

May 3, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webclaw-0.4.0.tar.gz (61.2 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webclaw-0.4.0-py3-none-any.whl (21.4 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file webclaw-0.4.0.tar.gz.

File metadata

Download URL: webclaw-0.4.0.tar.gz
Upload date: May 20, 2026
Size: 61.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for webclaw-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`0e93435beb6d91cdefef8b8d0c63aec8719c78c081df025fbd6ae5b59aa2a19a`
MD5	`0f7550ae6077b6b438db0f74489d8a34`
BLAKE2b-256	`03b9fcaa282d2ee7577142294fcb5b42c3ce69c7eaf15573d7bdb9dd3b18dccd`

See more details on using hashes here.

File details

Details for the file webclaw-0.4.0-py3-none-any.whl.

File metadata

Download URL: webclaw-0.4.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 21.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for webclaw-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`73e100c2f338a5ce45c7b40d0ee38023ce066f14d2262654fbbb59b41b1b5b19`
MD5	`131734b5cb179b8b95e90d3f05df5b6e`
BLAKE2b-256	`cae44bbef766058cc54c944e27606ab9ec6c64f5730c606c8509e9cb004b94c7`

See more details on using hashes here.

webclaw 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Quick Start

Sync

Async

Endpoints

Scrape

Vertical extractors

Search

Map

Batch

Endpoints

Extract

Summarize

Diff

Brand

Research

Crawl

Watch

Error Handling

Configuration

Async Usage

Type Support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes