Lightweight cloud SDK for Crawl4AI - mirrors the OSS API
Project description
Crawl4AI Cloud SDK
The fastest way to turn any URL into markdown, screenshots, structured data, or a full site crawl.
Install
pip install crawl4ai-cloud-sdk
Get your API key at api.crawl4ai.com.
Quick Start
import asyncio
from crawl4ai_cloud import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(api_key="sk_live_...") as crawler:
# Get clean markdown from any page
md = await crawler.markdown("https://example.com")
print(md.markdown)
# Take a full-page screenshot
ss = await crawler.screenshot("https://example.com")
# ss.screenshot is base64-encoded PNG
# Extract structured data with natural language
data = await crawler.extract(
"https://news.ycombinator.com",
query="get each story title, URL, and points",
)
print(data.data) # list of dicts
# Discover all URLs on a domain
sitemap = await crawler.map("https://docs.python.org")
for u in sitemap.urls[:10]:
print(u.url)
# Crawl an entire site (async, returns job_id)
job = await crawler.crawl_site(
"https://docs.example.com",
max_pages=50,
wait=True,
)
print(f"Done: {job.discovered_urls} pages crawled")
asyncio.run(main())
Wrapper API Reference
| Method | What it does | Endpoint |
|---|---|---|
markdown(url) |
Returns clean markdown (with optional fit/pruning) | POST /v1/markdown |
screenshot(url) |
Returns base64 screenshot (PNG) and optional PDF | POST /v1/screenshot |
extract(url, query=...) |
Extracts structured data (auto/LLM/CSS schema) | POST /v1/extract |
map(url) |
Simple URL discovery on a domain (always sync) | POST /v1/map |
scan(url, criteria=...) |
AI-assisted URL discovery with plain-English criteria + map/deep routing | POST /v1/scan |
crawl_site(url, criteria=..., extract=...) |
AI-assisted full site crawl — LLM generates scan config + extraction schema | POST /v1/crawl/site |
Each method returns a typed response object (MarkdownResponse, ScreenshotResponse, ExtractResponse, MapResponse, ScanResult, SiteCrawlResponse) with .success, .duration_ms, and .usage fields.
AI-assisted flows (v0.4.0)
Pass a plain-English criteria and let the backend LLM pick scan mode, URL patterns, filters, and scorers. Pair with extract on crawl_site() to also auto-generate a CSS extraction schema from a sample URL. The generated config is echoed back so you can see and reuse it.
async with AsyncWebCrawler(api_key="sk_live_...") as crawler:
# AI-assisted scan — LLM picks map vs deep + generates patterns/query/threshold
result = await crawler.scan(
"https://docs.crawl4ai.com",
criteria="API reference and core docs pages",
max_urls=50,
)
print(f"Mode: {result.mode_used}") # "map" or "deep"
print(f"Found: {result.total_urls} URLs")
if result.generated_config:
print(f"AI: {result.generated_config.reasoning}")
# Explicit deep scan with async polling
job = await crawler.scan(
"https://directory.example.com",
criteria="company profile pages",
scan={"mode": "deep", "max_depth": 3},
wait=True, # block until done
poll_interval=3.0,
)
# Flagship: crawl whole site + auto-extract structured data
job = await crawler.crawl_site(
"https://books.toscrape.com",
criteria="all book listing pages",
max_pages=50,
strategy="http",
extract={
"query": "book title, price, rating",
"json_example": {"title": "...", "price": "£0.00", "rating": 0},
"method": "auto", # picks CSS schema vs LLM
},
include=["links"], # drop markdown — extract-only
)
print(f"Generated schema: {bool(job.schema_used)}")
print(f"Method: {job.extraction_method_used}") # "css_schema" or "llm"
# Unified polling — one endpoint for scan + crawl phases
while True:
status = await crawler.get_site_crawl_job(job.job_id)
print(f"{status.phase}: {status.progress.urls_crawled}/{status.progress.total}")
if status.is_complete:
print(f"Download: {status.download_url}")
break
await asyncio.sleep(3)
Config objects (optional — both scan and extract accept plain dicts or typed dataclasses):
from crawl4ai_cloud import SiteScanConfig, SiteExtractConfig
scan_cfg = SiteScanConfig(
mode="auto", # "auto" | "map" | "deep"
patterns=["*/docs/*", "*/guide/*"],
scorers={"keywords": ["auth", "oauth"], "optimal_depth": 2},
max_depth=3,
)
extract_cfg = SiteExtractConfig(
query="book title, price, rating",
json_example={"title": "...", "price": "£0.00", "rating": 0},
method="auto",
)
job = await crawler.crawl_site(
"https://books.toscrape.com",
criteria="book listings",
scan=scan_cfg,
extract=extract_cfg,
)
Drop markdown with include: if you pass include=["links", "media"] without "markdown", the worker force-strips markdown from every result — saves bandwidth for extract-only crawls.
Async / Batch
Every wrapper method has a _many variant for processing multiple URLs as an async job.
async with AsyncWebCrawler(api_key="sk_live_...") as crawler:
# Batch markdown (fire-and-forget)
job = await crawler.markdown_many(
["https://a.com", "https://b.com", "https://c.com"],
)
print(f"Job {job.job_id} started, {job.urls_count} URLs queued")
# Batch markdown (wait for results)
job = await crawler.markdown_many(urls, wait=True, timeout=120)
# Batch screenshots
job = await crawler.screenshot_many(urls, full_page=True, wait=True)
# Batch extraction (note: method must be "llm" or "schema", not "auto")
job = await crawler.extract_many(
urls, method="llm", query="get product name and price", wait=True,
)
# Site crawl is always async — prefer criteria + extract over legacy discovery flag
site = await crawler.crawl_site(
"https://docs.example.com",
criteria="all API reference pages",
max_pages=100,
wait=True,
)
Job Management
Each wrapper namespace has its own job management methods.
# Markdown jobs
job = await crawler.get_markdown_job(job_id)
jobs = await crawler.list_markdown_jobs(status="completed", limit=10)
await crawler.cancel_markdown_job(job_id)
# Screenshot jobs
job = await crawler.get_screenshot_job(job_id)
jobs = await crawler.list_screenshot_jobs()
await crawler.cancel_screenshot_job(job_id)
# Extract jobs
job = await crawler.get_extract_job(job_id)
jobs = await crawler.list_extract_jobs()
await crawler.cancel_extract_job(job_id)
# Scan jobs (AI-assisted deep scans)
job = await crawler.get_scan_job(job_id) # unified status + URLs-so-far
await crawler.cancel_scan_job(job_id) # preserves partial results
# Site crawl jobs (unified scan + crawl polling)
job = await crawler.get_site_crawl_job(job_id) # phase: scan|crawl|done
# Cancel delegates to the underlying deep crawl job:
await crawler.cancel_deep_crawl(job_id)
# Core crawl jobs (from run_many / deep_crawl)
job = await crawler.get_job(job_id)
jobs = await crawler.list_jobs(status="running")
await crawler.cancel_job(job_id)
url = await crawler.download_url(job_id) # presigned S3 ZIP
Power User: Config Passthrough
All wrapper methods accept crawler_config and browser_config dicts for full control. These are the same fields you would pass to the core /v1/crawl endpoint.
md = await crawler.markdown(
"https://example.com",
strategy="browser",
fit=True,
include=["links", "media", "tables"],
crawler_config={
"css_selector": "article",
"exclude_external_links": True,
"wait_for": ".content-loaded",
"js_code": "window.scrollTo(0, document.body.scrollHeight)",
},
browser_config={
"viewport_width": 1920,
"viewport_height": 1080,
"headers": {"Accept-Language": "en-US"},
},
proxy="residential",
)
Works the same way for screenshot(), extract(), map(), and crawl_site().
Full Power Mode
For advanced use cases where you need full control over the crawl pipeline, the core methods give you direct access to the /v1/crawl endpoint with every configuration option.
Single URL
from crawl4ai_cloud import CrawlerRunConfig, BrowserConfig
config = CrawlerRunConfig(
screenshot=True,
word_count_threshold=10,
exclude_external_links=True,
process_iframes=True,
css_selector="article",
)
browser_config = BrowserConfig(
viewport_width=1920,
viewport_height=1080,
)
result = await crawler.run(
"https://example.com",
config=config,
browser_config=browser_config,
proxy="datacenter",
)
print(result.markdown.raw_markdown)
print(result.screenshot) # base64
Batch Crawl
job = await crawler.run_many(
["https://a.com", "https://b.com"],
config=config,
wait=True,
priority=1,
)
# Results available via download
url = await crawler.download_url(job.id)
Deep Crawl
result = await crawler.deep_crawl(
"https://docs.example.com",
strategy="bfs", # bfs, dfs, best_first, map
max_depth=3,
max_urls=100,
include_patterns=["docs", "api"],
exclude_patterns=["download"],
wait=True,
)
Domain Scan
scan = await crawler.scan("https://example.com", mode="deep", max_urls=200)
for url_info in scan.urls:
print(f"{url_info.url} (score: {url_info.relevance_score})")
Full reference: Cloud API Docs
Configuration
CrawlerRunConfig
Controls what gets extracted and how pages are processed.
from crawl4ai_cloud import CrawlerRunConfig
config = CrawlerRunConfig(
css_selector="main", # target specific elements
excluded_tags=["nav", "footer"],
word_count_threshold=10,
screenshot=True,
wait_for=".loaded", # wait for CSS selector
js_code="document.querySelector('.show-more').click()",
magic=True, # anti-bot mode
)
BrowserConfig
Controls the browser environment.
from crawl4ai_cloud import BrowserConfig
browser = BrowserConfig(
viewport_width=1920,
viewport_height=1080,
user_agent="MyBot/1.0",
headers={"Authorization": "Bearer token"},
cookies=[{"name": "session", "value": "abc", "domain": "example.com"}],
profile_id="my-saved-profile", # cloud browser profile
)
ProxyConfig
from crawl4ai_cloud import ProxyConfig
# Shorthand (works on all methods)
result = await crawler.markdown(url, proxy="datacenter")
result = await crawler.markdown(url, proxy="residential")
# Full config
proxy = ProxyConfig(mode="residential", country="US", sticky_session=True)
result = await crawler.markdown(url, proxy=proxy)
Proxy modes: "none" (direct, 1x credits), "datacenter" (fast, 2x), "residential" (premium, 5x), "auto" (smart selection).
Environment Variables
export CRAWL4AI_API_KEY=sk_live_...
# API key auto-loaded from environment
async with AsyncWebCrawler() as crawler:
md = await crawler.markdown("https://example.com")
Error Handling
from crawl4ai_cloud import (
CloudError,
AuthenticationError,
RateLimitError,
QuotaExceededError,
NotFoundError,
ValidationError,
TimeoutError,
ServerError,
)
try:
result = await crawler.markdown(url)
except AuthenticationError:
print("Invalid API key")
except RateLimitError as e:
print(f"Rate limited. Retry after {e.retry_after}s")
except QuotaExceededError as e:
print(f"Quota exceeded ({e.quota_type})")
except TimeoutError:
print("Request timed out")
except ValidationError:
print("Invalid request parameters")
except ServerError:
print("Server error, try again later")
except CloudError as e:
print(f"[{e.status_code}] {e.message}")
Claude Code Plugin
Use Crawl4AI directly inside Claude Code with 9 built-in tools.
/plugin marketplace add unclecode/crawl4ai-cloud-sdk
/plugin install crawl4ai@crawl4ai-claude-plugins
See plugin README for details.
Links
- Cloud Dashboard - Sign up and manage your API key
- Cloud API Docs - Full API reference
- PyPI - Package page
- GitHub - Source code
- OSS Crawl4AI - Self-hosted option
- Discord - Community and support
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawl4ai_cloud_sdk-0.4.0.tar.gz.
File metadata
- Download URL: crawl4ai_cloud_sdk-0.4.0.tar.gz
- Upload date:
- Size: 58.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acd61200b5a96601d25a98b9260b92a883fa83c53eeb2bf4266bae01bd161e62
|
|
| MD5 |
07adf73b56a22c64081febc91c18c7d7
|
|
| BLAKE2b-256 |
07da697e8a89ee3b2b3ad2db47fb6c5bcc517eee89ef75fc2c7aa1e4f606c13d
|
File details
Details for the file crawl4ai_cloud_sdk-0.4.0-py3-none-any.whl.
File metadata
- Download URL: crawl4ai_cloud_sdk-0.4.0-py3-none-any.whl
- Upload date:
- Size: 49.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e040813babbfc09febedd5bca29b6eb3edd5f3b5e05a6e7a5f55a6afd669221
|
|
| MD5 |
75aef2996f5b0b28617bf1be14a1dfb4
|
|
| BLAKE2b-256 |
a839494afe28818dc8636e37697585086bc313456da0b6f7c724377f7ea99736
|