Skip to main content

Python SDK for the Webclaw web extraction API

Project description

webclaw

Python SDK for the Webclaw web extraction API

PyPI Python License


Note: The webclaw Cloud API is public. Create an API key at webclaw.io or use the open-source CLI/MCP for local extraction.


Installation

pip install webclaw

Requires Python 3.9+. The only dependency is httpx.

Quick Start

Sync

from webclaw import Webclaw

client = Webclaw("wc-YOUR_API_KEY")

result = client.scrape("https://example.com", formats=["markdown"])
print(result.markdown)

Async

from webclaw import AsyncWebclaw

async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com", formats=["markdown"])
    print(result.markdown)

Both clients support identical method signatures. Every sync method has an async equivalent. The examples below use the sync client for brevity.

Endpoints

Scrape

Extract content from a single URL. Supports multiple output formats: "markdown", "text", "llm", "json".

result = client.scrape(
    "https://example.com",
    formats=["markdown", "text", "llm"],
    include_selectors=["article", ".content"],
    exclude_selectors=["nav", "footer"],
    only_main_content=True,
    no_cache=True,
)

result.url        # str
result.markdown   # str | None
result.text       # str | None
result.llm        # str | None
result.json_data  # Any | None
result.metadata   # dict
result.cache      # CacheInfo | None  (.status: "hit" | "miss" | "bypass")
result.warning    # str | None

Vertical extractors

28 site-specific extractors that return typed JSON (GitHub, Reddit, Amazon, YouTube, PyPI, HuggingFace, Trustpilot, etc.) instead of generic markdown. See the catalog for the full list.

# Discover available extractors
catalog = client.list_extractors()
for e in catalog["extractors"]:
    print(e["name"], "-", e["label"])

# Run a specific extractor
pr = client.scrape_vertical(
    "github_pr",
    "https://github.com/rust-lang/rust/pull/123456",
)
print(pr["data"])  # {title, state, author, commits, reviews, ...}

# Amazon product as typed JSON
product = client.scrape_vertical(
    "amazon_product",
    "https://www.amazon.com/dp/B0C6KKQ7ND",
)
print(product["data"]["price"], product["data"]["rating"])

The data field is extractor-specific; call list_extractors() to discover what each returns. Both methods have async equivalents on AsyncWebclaw.

Search

Web search with optional topic filtering.

results = client.search("web scraping tools 2026", num_results=10, topic="tech")

for r in results["results"]:
    print(r["title"], r["url"])

Parameters: query (str), num_results (int, optional), topic (str, optional).

Map

Discover URLs via sitemap.

result = client.map("https://example.com")

print(result.count)
for url in result.urls:
    print(url)

Batch

Scrape multiple URLs in parallel.

result = client.batch(
    ["https://a.com", "https://b.com", "https://c.com"],
    formats=["markdown"],
    concurrency=5,
)

for item in result.results:
    print(item.url, item.markdown, item.error or "ok")

Parameters: urls (list[str]), formats (optional), concurrency (int, default 5).

Extract

LLM-powered structured data extraction. Use either a JSON schema or a natural language prompt.

# Schema-based extraction
result = client.extract(
    "https://example.com/pricing",
    schema={
        "type": "object",
        "properties": {
            "plans": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "string"},
                    },
                },
            }
        },
    },
)
print(result.data)  # dict matching your schema

# Prompt-based extraction
result = client.extract(
    "https://example.com/pricing",
    prompt="Extract all pricing tiers with names and monthly prices",
)
print(result.data)

Summarize

Summarize page content with an optional sentence limit.

result = client.summarize("https://example.com", max_sentences=3)
print(result.summary)

Diff

Detect content changes at a URL since the last check.

result = client.diff("https://example.com/status")

print(result["has_changed"])  # bool
print(result["diff"])         # str, unified diff of changes

Brand

Extract brand identity (colors, fonts, logos) from a URL.

result = client.brand("https://example.com")
print(result.data)  # dict with brand identity fields

Research

Deep research that searches, reads, and synthesizes information from multiple sources. This is an async job: the SDK starts it and polls until completion.

# Blocks until research completes (up to 600s, or 1200s with deep=True)
result = client.research(
    "How do modern web crawlers handle JavaScript rendering?",
    max_sources=15,
    deep=True,
    topic="tech",
)

print(result.report)
print(result.iterations)
print(result.elapsed_ms)

for source in result.sources:
    print(source["url"], source["title"])

To check status without blocking:

status = client.get_research_status("job-id-here")
print(status.status)  # "running" | "completed" | "failed"

Parameters: query (str), deep (bool, default False), max_sources (int, optional), max_iterations (int, optional), topic (str, optional).

Crawl

Start an async crawl that follows links from a seed URL.

job = client.crawl(
    "https://example.com",
    max_depth=3,
    max_pages=100,
    use_sitemap=True,
)

# Poll until complete (default timeout 300s)
status = job.wait(interval=2.0, timeout=300.0)

print(status.total, status.completed, status.errors)
for page in status.pages:
    print(page.url, len(page.markdown or ""))

Check status without waiting:

status = job.get_status()
print(status.status)  # "running" | "completed" | "failed"

Async variant:

job = await client.crawl("https://example.com", max_depth=2)
status = await job.wait()

Watch

Monitor URLs for content changes with automatic periodic checks.

Create a watch:

watch = client.watch_create(
    "https://example.com/pricing",
    name="Pricing page monitor",
    interval_minutes=60,
    webhook_url="https://hooks.example.com/webclaw",
)
print(watch.id, watch.status)

List all watches:

result = client.watch_list(limit=50, offset=0)
for w in result.watches:
    print(w.id, w.url, w.name, w.last_checked)
print(result.total)

Get a single watch:

watch = client.watch_get("watch-id-here")
print(watch.url, watch.interval_minutes)

Delete a watch:

client.watch_delete("watch-id-here")

Trigger an immediate check:

check = client.watch_check("watch-id-here")
print(check.has_changed)  # bool
print(check.diff)         # str | None
print(check.checked_at)   # ISO timestamp

Error Handling

All errors inherit from WebclawError, which carries the HTTP status code when available.

from webclaw import (
    WebclawError,
    AuthenticationError,
    NotFoundError,
    RateLimitError,
    TimeoutError,
)

try:
    result = client.scrape("https://example.com")
except AuthenticationError:
    print("Invalid or missing API key")
except RateLimitError:
    print("Too many requests, slow down")
except NotFoundError:
    print("Resource not found")
except TimeoutError as e:
    print(f"Operation timed out: {e}")
except WebclawError as e:
    print(f"API error (status {e.status_code}): {e}")
Exception HTTP Status When
AuthenticationError 401 / 403 Invalid or missing API key
NotFoundError 404 Resource does not exist
RateLimitError 429 Too many requests
TimeoutError -- Crawl/research polling exceeded timeout
WebclawError Any Base class for all other API errors

Configuration

import os
from webclaw import Webclaw

client = Webclaw(
    os.environ["WEBCLAW_API_KEY"],
    base_url="https://api.webclaw.io",  # default
    timeout=60.0,                        # seconds, default 30
)

Both Webclaw and AsyncWebclaw support context managers for automatic cleanup:

# Sync
with Webclaw("wc-YOUR_API_KEY") as client:
    result = client.scrape("https://example.com")

# Async
async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com")

Async Usage

Every endpoint is available on AsyncWebclaw with identical parameters. Use await on all method calls and async with for the context manager.

import asyncio
from webclaw import AsyncWebclaw

async def main():
    async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
        # Run multiple scrapes concurrently
        results = await asyncio.gather(
            client.scrape("https://a.com", formats=["markdown"]),
            client.scrape("https://b.com", formats=["markdown"]),
            client.scrape("https://c.com", formats=["markdown"]),
        )
        for r in results:
            print(r.url, len(r.markdown or ""))

asyncio.run(main())

Type Support

This package ships with a py.typed marker (PEP 561). Type checkers like mypy and pyright will pick up all type annotations automatically. All response types are dataclasses importable from the top-level package:

from webclaw import ScrapeResponse, CrawlStatus, MapResponse, ExtractResponse

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webclaw-0.2.1.tar.gz (54.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webclaw-0.2.1-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file webclaw-0.2.1.tar.gz.

File metadata

  • Download URL: webclaw-0.2.1.tar.gz
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for webclaw-0.2.1.tar.gz
Algorithm Hash digest
SHA256 df2c7d1965ac66d27208bd802992428406cae2cb4206a51a34c712f0b7bb4753
MD5 2b79234b512c7b38bab4f2e92939aa0c
BLAKE2b-256 7307451ea96f806ff0600e61fbf822e8fb86586ef0ffa3afebc057e01182e4f1

See more details on using hashes here.

File details

Details for the file webclaw-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: webclaw-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for webclaw-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5cfadfff326288fd8b9623a36a226349968f16e7c1e9a986ae2618baa6e25a6f
MD5 d2567ab2166d657558eabcbf6cbf64ec
BLAKE2b-256 c7841add299c089cf5fcba90e6cddea5fa8d1d296a89aa6d5b87dc5edd0a53a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page