Skip to main content

Python SDK for the Webclaw web extraction API

Project description

webclaw

Python SDK for the Webclaw web extraction API

PyPI Python License


Note: The webclaw Cloud API is currently in closed beta. Request early access or use the open-source CLI/MCP for local extraction.


Installation

pip install webclaw

Requires Python 3.9+. The only dependency is httpx.

Quick Start

Sync

from webclaw import Webclaw

client = Webclaw("wc-YOUR_API_KEY")

result = client.scrape("https://example.com", formats=["markdown"])
print(result.markdown)

Async

from webclaw import AsyncWebclaw

async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com", formats=["markdown"])
    print(result.markdown)

Both clients support identical method signatures. Every sync method has an async equivalent. The examples below use the sync client for brevity.

Endpoints

Scrape

Extract content from a single URL. Supports multiple output formats: "markdown", "text", "llm", "json".

result = client.scrape(
    "https://example.com",
    formats=["markdown", "text", "llm"],
    include_selectors=["article", ".content"],
    exclude_selectors=["nav", "footer"],
    only_main_content=True,
    no_cache=True,
)

result.url        # str
result.markdown   # str | None
result.text       # str | None
result.llm        # str | None
result.json_data  # Any | None
result.metadata   # dict
result.cache      # CacheInfo | None  (.status: "hit" | "miss" | "bypass")
result.warning    # str | None

Search

Web search with optional topic filtering.

results = client.search("web scraping tools 2026", num_results=10, topic="tech")

for r in results["results"]:
    print(r["title"], r["url"])

Parameters: query (str), num_results (int, optional), topic (str, optional).

Map

Discover URLs via sitemap.

result = client.map("https://example.com")

print(result.count)
for url in result.urls:
    print(url)

Batch

Scrape multiple URLs in parallel.

result = client.batch(
    ["https://a.com", "https://b.com", "https://c.com"],
    formats=["markdown"],
    concurrency=5,
)

for item in result.results:
    print(item.url, item.markdown, item.error or "ok")

Parameters: urls (list[str]), formats (optional), concurrency (int, default 5).

Extract

LLM-powered structured data extraction. Use either a JSON schema or a natural language prompt.

# Schema-based extraction
result = client.extract(
    "https://example.com/pricing",
    schema={
        "type": "object",
        "properties": {
            "plans": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "string"},
                    },
                },
            }
        },
    },
)
print(result.data)  # dict matching your schema

# Prompt-based extraction
result = client.extract(
    "https://example.com/pricing",
    prompt="Extract all pricing tiers with names and monthly prices",
)
print(result.data)

Summarize

Summarize page content with an optional sentence limit.

result = client.summarize("https://example.com", max_sentences=3)
print(result.summary)

Diff

Detect content changes at a URL since the last check.

result = client.diff("https://example.com/status")

print(result["has_changed"])  # bool
print(result["diff"])         # str, unified diff of changes

Brand

Extract brand identity (colors, fonts, logos) from a URL.

result = client.brand("https://example.com")
print(result.data)  # dict with brand identity fields

Agent Scrape

AI-guided scraping that navigates a page to achieve a specified goal.

result = client.agent_scrape(
    "https://example.com/dashboard",
    goal="Find the monthly active users count",
)

print(result["result"])
print(result["steps"])

Parameters: url (str), goal (str), plus optional keyword arguments forwarded to the API.

Research

Deep research that searches, reads, and synthesizes information from multiple sources. This is an async job: the SDK starts it and polls until completion.

# Blocks until research completes (up to 600s, or 1200s with deep=True)
result = client.research(
    "How do modern web crawlers handle JavaScript rendering?",
    max_sources=15,
    deep=True,
    topic="tech",
)

print(result.report)
print(result.iterations)
print(result.elapsed_ms)

for source in result.sources:
    print(source["url"], source["title"])

To check status without blocking:

status = client.get_research_status("job-id-here")
print(status.status)  # "running" | "completed" | "failed"

Parameters: query (str), deep (bool, default False), max_sources (int, optional), max_iterations (int, optional), topic (str, optional).

Crawl

Start an async crawl that follows links from a seed URL.

job = client.crawl(
    "https://example.com",
    max_depth=3,
    max_pages=100,
    use_sitemap=True,
)

# Poll until complete (default timeout 300s)
status = job.wait(interval=2.0, timeout=300.0)

print(status.total, status.completed, status.errors)
for page in status.pages:
    print(page.url, len(page.markdown or ""))

Check status without waiting:

status = job.get_status()
print(status.status)  # "running" | "completed" | "failed"

Async variant:

job = await client.crawl("https://example.com", max_depth=2)
status = await job.wait()

Watch

Monitor URLs for content changes with automatic periodic checks.

Create a watch:

watch = client.watch_create(
    "https://example.com/pricing",
    name="Pricing page monitor",
    interval_minutes=60,
    webhook_url="https://hooks.example.com/webclaw",
)
print(watch.id, watch.status)

List all watches:

result = client.watch_list(limit=50, offset=0)
for w in result.watches:
    print(w.id, w.url, w.name, w.last_checked)
print(result.total)

Get a single watch:

watch = client.watch_get("watch-id-here")
print(watch.url, watch.interval_minutes)

Delete a watch:

client.watch_delete("watch-id-here")

Trigger an immediate check:

check = client.watch_check("watch-id-here")
print(check.has_changed)  # bool
print(check.diff)         # str | None
print(check.checked_at)   # ISO timestamp

Error Handling

All errors inherit from WebclawError, which carries the HTTP status code when available.

from webclaw import (
    WebclawError,
    AuthenticationError,
    NotFoundError,
    RateLimitError,
    TimeoutError,
)

try:
    result = client.scrape("https://example.com")
except AuthenticationError:
    print("Invalid or missing API key")
except RateLimitError:
    print("Too many requests, slow down")
except NotFoundError:
    print("Resource not found")
except TimeoutError as e:
    print(f"Operation timed out: {e}")
except WebclawError as e:
    print(f"API error (status {e.status_code}): {e}")
Exception HTTP Status When
AuthenticationError 401 / 403 Invalid or missing API key
NotFoundError 404 Resource does not exist
RateLimitError 429 Too many requests
TimeoutError -- Crawl/research polling exceeded timeout
WebclawError Any Base class for all other API errors

Configuration

import os
from webclaw import Webclaw

client = Webclaw(
    os.environ["WEBCLAW_API_KEY"],
    base_url="https://api.webclaw.io",  # default
    timeout=60.0,                        # seconds, default 30
)

Both Webclaw and AsyncWebclaw support context managers for automatic cleanup:

# Sync
with Webclaw("wc-YOUR_API_KEY") as client:
    result = client.scrape("https://example.com")

# Async
async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
    result = await client.scrape("https://example.com")

Async Usage

Every endpoint is available on AsyncWebclaw with identical parameters. Use await on all method calls and async with for the context manager.

import asyncio
from webclaw import AsyncWebclaw

async def main():
    async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
        # Run multiple scrapes concurrently
        results = await asyncio.gather(
            client.scrape("https://a.com", formats=["markdown"]),
            client.scrape("https://b.com", formats=["markdown"]),
            client.scrape("https://c.com", formats=["markdown"]),
        )
        for r in results:
            print(r.url, len(r.markdown or ""))

asyncio.run(main())

Type Support

This package ships with a py.typed marker (PEP 561). Type checkers like mypy and pyright will pick up all type annotations automatically. All response types are dataclasses importable from the top-level package:

from webclaw import ScrapeResponse, CrawlStatus, MapResponse, ExtractResponse

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webclaw-0.1.0.tar.gz (52.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webclaw-0.1.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file webclaw-0.1.0.tar.gz.

File metadata

  • Download URL: webclaw-0.1.0.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for webclaw-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2d92a2f56952d23e2ae9a4e4aadccc098cc3168687056c77ac6422fa55d9ad9f
MD5 9915726799af2f4cabcb69e09e54ecd7
BLAKE2b-256 510064c94a9c16d919ccb2687611df5695aa2cbddaef551abc10d338bea6218f

See more details on using hashes here.

File details

Details for the file webclaw-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: webclaw-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for webclaw-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f968a40667bb2c22a9853ac9740937cc2bbc17b444fd653b9e6a51ea6509e516
MD5 5ac2f353b5527d380d2f66ef603ae70f
BLAKE2b-256 9b9c793bcde87b224cbf79b54948cfebb7756b45ebcd45128eccc69d9d5021b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page