Official Python SDK for crawlbrulee - web-scraping API.
Project description
crawlbrulee
The official Python SDK for the crawlbrulee web-scraping API.
- Hand-written, fully typed (ships
py.typed). - Sync and async clients (
Crawlbrulee/AsyncCrawlbrulee). - One runtime dependency:
httpx. - Python 3.10+.
Status: v0.1.0 (beta). The API surface is stabilizing — expect minor breaking changes between 0.x releases.
Install
pip install crawlbrulee
# or: uv add crawlbrulee
Quickstart
from crawlbrulee import Crawlbrulee, ScrapeExtract
client = Crawlbrulee(api_key="cble_…")
# or read CRAWLBRULEE_API_KEY from the environment:
client = Crawlbrulee.from_env()
page = client.scrape(
url="https://example.com",
extract=ScrapeExtract(markdown=True, links=True),
)
print(page.markdown)
print(len(page.links or []), "links found")
Async
The async client mirrors the sync one method-for-method:
import asyncio
from crawlbrulee import AsyncCrawlbrulee
async def main() -> None:
async with AsyncCrawlbrulee.from_env() as client:
page = await client.scrape(url="https://example.com")
print(page.markdown)
asyncio.run(main())
Configuration
| Option | Default | Description |
|---|---|---|
api_key |
— | Sent as Authorization: Bearer …. Required — or use from_env(). |
base_url |
https://api.crawlbrulee.com |
Override the target host (local dev / staging). Trailing slashes stripped. |
timeout |
None (no timeout) |
Per-request timeout in seconds. A per-call timeout= overrides it. |
Crawlbrulee.from_env(**overrides) reads the key from CRAWLBRULEE_API_KEY and
forwards any other option through.
Both clients support context managers (with / async with) and expose
close() / aclose() to release the connection pool.
Request inputs
Top-level request fields are plain keyword arguments. Nested structures are typed
dataclasses (importable from crawlbrulee) — or plain dicts, if you prefer:
from crawlbrulee import ScrapeExtract, ScreenshotRequest
client.scrape(
url="https://news.example.com/article-1",
extract=ScrapeExtract(
markdown=True,
metadata=True,
links=True,
screenshot=ScreenshotRequest(type="full_page", device_mode="desktop"),
),
require_js=True,
proxy="advanced",
exclude_selectors=["nav", "footer"],
cache={"max_age": 3600}, # dataclass or dict, your call
location={"country": "US"},
)
None-valued options are omitted from the request entirely, so the server's
defaults apply.
API reference
Every method returns a typed dataclass and accepts a per-call timeout= (seconds).
Scraping
| Method | Description |
|---|---|
scrape(url, **opts) |
Scrape a URL synchronously; blocks until done. |
scrape_async(url, **opts) |
Submit a background job; returns { job_id } immediately. |
get_scrape_status(job_id) |
Current job state — pending / running / done / failed. |
get_scrape_result(job_id) |
Result of a completed job (raises if not finished). |
wait_for_scrape(job_id, interval=2.0, timeout=300.0) |
Poll until terminal, then return the result. |
job = client.scrape_async(url="https://example.com")
page = client.wait_for_scrape(job.job_id, interval=2.0, timeout=300.0)
wait_for_scrape raises a CrawlbruleeError with error_name="job_failed" if the
job fails, or error_name="request_timeout" if the wait expires (timeout=0 waits
forever).
Mapping
result = client.map(
url="https://example.com",
sitemap_only=False,
types={"internal": True, "external": False},
max_urls=5_000,
page=1,
limit=1_000,
)
print(len(result.links), "of", result.meta.pagination.total_pages, "pages")
Account
| Method | Description |
|---|---|
usage() |
Current billing-cycle snapshot — credits, quota %, concurrency, reset time. |
whoami() |
Organization + token identity behind the API key. |
Errors
Every failure raised by the SDK subclasses CrawlbruleeError:
| Class | When |
|---|---|
AuthenticationError |
401 / 403 (missing, invalid, or unauthorized key). |
RateLimitError |
429. Exposes retry_after_ms and limited_by when provided. |
UsageAllocationError |
Plan limit hit. Exposes reason and usage. |
ValidationError |
Bad request (invalid_url, url_too_long, blocked_url, …). |
NotFoundError |
404 (e.g. unknown async job_id). |
TransportError |
Network failure, timeout, or non-JSON response. |
CrawlbruleeError |
Base class — any other API error. Always has status, error_name, message. |
import time
from crawlbrulee import Crawlbrulee, RateLimitError, UsageAllocationError
client = Crawlbrulee.from_env()
try:
client.scrape(url="https://example.com")
except RateLimitError as err:
time.sleep((err.retry_after_ms or 1000) / 1000)
# retry…
except UsageAllocationError as err:
print("Plan limit hit:", err.reason, err.usage)
For exhaustive branching, switch on err.error_name.
Notes on the wire format
The SDK mirrors the API's JSON shapes faithfully. The one exception: the async job
status response uses camelCase on the wire (jobId, createdAt); the SDK
exposes Pythonic job_id / created_at on AsyncJobStatusResponse.
Development
uv sync # or: pip install -e ".[dev]"
ruff check . && ruff format --check .
pyright
pytest
The SDK keeps a single runtime dependency (httpx) on purpose — please keep it that
way when contributing.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlbrulee-0.1.1.tar.gz.
File metadata
- Download URL: crawlbrulee-0.1.1.tar.gz
- Upload date:
- Size: 39.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77f7115652afd63a787886f9a264b62a809d41eafcb6a5a209f642696218e555
|
|
| MD5 |
6f4c904d96c5afea158ec05fc334bee1
|
|
| BLAKE2b-256 |
dc5bcb2c2865596873c77024ec15b08790b8cd597bc1d159e863350efca249e8
|
Provenance
The following attestation bundles were made for crawlbrulee-0.1.1.tar.gz:
Publisher:
publish.yml on crawlbrulee/crawlbrulee-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crawlbrulee-0.1.1.tar.gz -
Subject digest:
77f7115652afd63a787886f9a264b62a809d41eafcb6a5a209f642696218e555 - Sigstore transparency entry: 1671551496
- Sigstore integration time:
-
Permalink:
crawlbrulee/crawlbrulee-py@d307034a6ed8beb16f859e3fa659eb4e72a60dae -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/crawlbrulee
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d307034a6ed8beb16f859e3fa659eb4e72a60dae -
Trigger Event:
push
-
Statement type:
File details
Details for the file crawlbrulee-0.1.1-py3-none-any.whl.
File metadata
- Download URL: crawlbrulee-0.1.1-py3-none-any.whl
- Upload date:
- Size: 36.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b808324e849ee3fbc8c56680d8172b6287308c76abdf1c286bc48c532ad5e713
|
|
| MD5 |
a659dd2c2a0018486267b739f6c51bb2
|
|
| BLAKE2b-256 |
af1398bba877db02cada6a20800df3e35587c303c35f56ceeabe876fb723be55
|
Provenance
The following attestation bundles were made for crawlbrulee-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on crawlbrulee/crawlbrulee-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crawlbrulee-0.1.1-py3-none-any.whl -
Subject digest:
b808324e849ee3fbc8c56680d8172b6287308c76abdf1c286bc48c532ad5e713 - Sigstore transparency entry: 1671551532
- Sigstore integration time:
-
Permalink:
crawlbrulee/crawlbrulee-py@d307034a6ed8beb16f859e3fa659eb4e72a60dae -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/crawlbrulee
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d307034a6ed8beb16f859e3fa659eb4e72a60dae -
Trigger Event:
push
-
Statement type: