Python SDK for the Webclaw web extraction API
Project description
Python SDK for the Webclaw web extraction API
Note: The webclaw Cloud API is public. Create an API key at webclaw.io or use the open-source CLI/MCP for local extraction.
Installation
pip install webclaw
Requires Python 3.9+. The only dependency is httpx.
Quick Start
Sync
from webclaw import Webclaw
client = Webclaw("wc-YOUR_API_KEY")
result = client.scrape("https://example.com", formats=["markdown"])
print(result.markdown)
Async
from webclaw import AsyncWebclaw
async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
result = await client.scrape("https://example.com", formats=["markdown"])
print(result.markdown)
Both clients support identical method signatures. Every sync method has an async equivalent. The examples below use the sync client for brevity.
Endpoints
Scrape
Extract content from a single URL. Supports multiple output formats: "markdown", "text", "llm", "json".
result = client.scrape(
"https://example.com",
formats=["markdown", "text", "llm"],
include_selectors=["article", ".content"],
exclude_selectors=["nav", "footer"],
only_main_content=True,
no_cache=True,
)
result.url # str
result.markdown # str | None
result.text # str | None
result.llm # str | None
result.json_data # Any | None
result.metadata # dict
result.cache # CacheInfo | None (.status: "hit" | "miss" | "bypass")
result.warning # str | None
Vertical extractors
28 site-specific extractors that return typed JSON (GitHub, Reddit, Amazon, YouTube, PyPI, HuggingFace, Trustpilot, etc.) instead of generic markdown. See the catalog for the full list.
# Discover available extractors
catalog = client.list_extractors()
for e in catalog["extractors"]:
print(e["name"], "-", e["label"])
# Run a specific extractor
pr = client.scrape_vertical(
"github_pr",
"https://github.com/rust-lang/rust/pull/123456",
)
print(pr["data"]) # {title, state, author, commits, reviews, ...}
# Amazon product as typed JSON
product = client.scrape_vertical(
"amazon_product",
"https://www.amazon.com/dp/B0C6KKQ7ND",
)
print(product["data"]["price"], product["data"]["rating"])
The data field is extractor-specific; call list_extractors() to discover what each returns. Both methods have async equivalents on AsyncWebclaw.
Search
Web search with optional topic filtering.
results = client.search("web scraping tools 2026", num_results=10, topic="tech")
for r in results["results"]:
print(r["title"], r["url"])
Parameters: query (str), num_results (int, optional), topic (str, optional).
Map
Discover URLs via sitemap.
result = client.map("https://example.com")
print(result.count)
for url in result.urls:
print(url)
Batch
Scrape multiple URLs in parallel.
result = client.batch(
["https://a.com", "https://b.com", "https://c.com"],
formats=["markdown"],
concurrency=5,
)
for item in result.results:
print(item.url, item.markdown, item.error or "ok")
Parameters: urls (list[str]), formats (optional), concurrency (int, default 5).
Extract
LLM-powered structured data extraction. Use either a JSON schema or a natural language prompt.
# Schema-based extraction
result = client.extract(
"https://example.com/pricing",
schema={
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
},
},
}
},
},
)
print(result.data) # dict matching your schema
# Prompt-based extraction
result = client.extract(
"https://example.com/pricing",
prompt="Extract all pricing tiers with names and monthly prices",
)
print(result.data)
Summarize
Summarize page content with an optional sentence limit.
result = client.summarize("https://example.com", max_sentences=3)
print(result.summary)
Diff
Detect content changes at a URL since the last check.
result = client.diff("https://example.com/status")
print(result["has_changed"]) # bool
print(result["diff"]) # str, unified diff of changes
Brand
Extract brand identity (colors, fonts, logos) from a URL.
result = client.brand("https://example.com")
print(result.data) # dict with brand identity fields
Research
Deep research that searches, reads, and synthesizes information from multiple sources. This is an async job: the SDK starts it and polls until completion.
# Blocks until research completes (up to 600s, or 1200s with deep=True)
result = client.research(
"How do modern web crawlers handle JavaScript rendering?",
max_sources=15,
deep=True,
topic="tech",
)
print(result.report)
print(result.iterations)
print(result.elapsed_ms)
for source in result.sources:
print(source["url"], source["title"])
To check status without blocking:
status = client.get_research_status("job-id-here")
print(status.status) # "running" | "completed" | "failed"
Parameters: query (str), deep (bool, default False), max_sources (int, optional), max_iterations (int, optional), topic (str, optional).
Crawl
Start an async crawl that follows links from a seed URL.
job = client.crawl(
"https://example.com",
max_depth=3,
max_pages=100,
use_sitemap=True,
)
# Poll until complete (default timeout 300s)
status = job.wait(interval=2.0, timeout=300.0)
print(status.total, status.completed, status.errors)
for page in status.pages:
print(page.url, len(page.markdown or ""))
Check status without waiting:
status = job.get_status()
print(status.status) # "running" | "completed" | "failed"
Async variant:
job = await client.crawl("https://example.com", max_depth=2)
status = await job.wait()
Watch
Monitor URLs for content changes with automatic periodic checks.
Create a watch:
watch = client.watch_create(
"https://example.com/pricing",
name="Pricing page monitor",
interval_minutes=60,
webhook_url="https://hooks.example.com/webclaw",
)
print(watch.id, watch.status)
List all watches:
result = client.watch_list(limit=50, offset=0)
for w in result.watches:
print(w.id, w.url, w.name, w.last_checked)
print(result.total)
Get a single watch:
watch = client.watch_get("watch-id-here")
print(watch.url, watch.interval_minutes)
Delete a watch:
client.watch_delete("watch-id-here")
Trigger an immediate check:
check = client.watch_check("watch-id-here")
print(check.has_changed) # bool
print(check.diff) # str | None
print(check.checked_at) # ISO timestamp
Error Handling
All errors inherit from WebclawError, which carries the HTTP status code when available.
from webclaw import (
WebclawError,
AuthenticationError,
NotFoundError,
RateLimitError,
TimeoutError,
)
try:
result = client.scrape("https://example.com")
except AuthenticationError:
print("Invalid or missing API key")
except RateLimitError:
print("Too many requests, slow down")
except NotFoundError:
print("Resource not found")
except TimeoutError as e:
print(f"Operation timed out: {e}")
except WebclawError as e:
print(f"API error (status {e.status_code}): {e}")
| Exception | HTTP Status | When |
|---|---|---|
AuthenticationError |
401 / 403 | Invalid or missing API key |
NotFoundError |
404 | Resource does not exist |
RateLimitError |
429 | Too many requests |
TimeoutError |
-- | Crawl/research polling exceeded timeout |
WebclawError |
Any | Base class for all other API errors |
Configuration
import os
from webclaw import Webclaw
client = Webclaw(
os.environ["WEBCLAW_API_KEY"],
base_url="https://api.webclaw.io", # default
timeout=60.0, # seconds, default 30
)
Both Webclaw and AsyncWebclaw support context managers for automatic cleanup:
# Sync
with Webclaw("wc-YOUR_API_KEY") as client:
result = client.scrape("https://example.com")
# Async
async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
result = await client.scrape("https://example.com")
Async Usage
Every endpoint is available on AsyncWebclaw with identical parameters. Use await on all method calls and async with for the context manager.
import asyncio
from webclaw import AsyncWebclaw
async def main():
async with AsyncWebclaw("wc-YOUR_API_KEY") as client:
# Run multiple scrapes concurrently
results = await asyncio.gather(
client.scrape("https://a.com", formats=["markdown"]),
client.scrape("https://b.com", formats=["markdown"]),
client.scrape("https://c.com", formats=["markdown"]),
)
for r in results:
print(r.url, len(r.markdown or ""))
asyncio.run(main())
Type Support
This package ships with a py.typed marker (PEP 561). Type checkers like mypy and pyright will pick up all type annotations automatically. All response types are dataclasses importable from the top-level package:
from webclaw import ScrapeResponse, CrawlStatus, MapResponse, ExtractResponse
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webclaw-0.2.1.tar.gz.
File metadata
- Download URL: webclaw-0.2.1.tar.gz
- Upload date:
- Size: 54.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df2c7d1965ac66d27208bd802992428406cae2cb4206a51a34c712f0b7bb4753
|
|
| MD5 |
2b79234b512c7b38bab4f2e92939aa0c
|
|
| BLAKE2b-256 |
7307451ea96f806ff0600e61fbf822e8fb86586ef0ffa3afebc057e01182e4f1
|
File details
Details for the file webclaw-0.2.1-py3-none-any.whl.
File metadata
- Download URL: webclaw-0.2.1-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cfadfff326288fd8b9623a36a226349968f16e7c1e9a986ae2618baa6e25a6f
|
|
| MD5 |
d2567ab2166d657558eabcbf6cbf64ec
|
|
| BLAKE2b-256 |
c7841add299c089cf5fcba90e6cddea5fa8d1d296a89aa6d5b87dc5edd0a53a4
|