Skip to main content

Official Python SDK for crawlbrulee - web-scraping API.

Project description

crawlbrulee

The official Python SDK for the crawlbrulee web-scraping API.

  • Hand-written, fully typed (ships py.typed).
  • Sync and async clients (Crawlbrulee / AsyncCrawlbrulee).
  • One runtime dependency: httpx.
  • Python 3.10+.

Status: v0.1.0 (beta). The API surface is stabilizing — expect minor breaking changes between 0.x releases.


Install

pip install crawlbrulee
# or: uv add crawlbrulee

Quickstart

from crawlbrulee import Crawlbrulee, ScrapeExtract

client = Crawlbrulee(api_key="cble_…")
# or read CRAWLBRULEE_API_KEY from the environment:
client = Crawlbrulee.from_env()

page = client.scrape(
    url="https://example.com",
    extract=ScrapeExtract(markdown=True, links=True),
)

print(page.markdown)
print(len(page.links or []), "links found")

Async

The async client mirrors the sync one method-for-method:

import asyncio
from crawlbrulee import AsyncCrawlbrulee

async def main() -> None:
    async with AsyncCrawlbrulee.from_env() as client:
        page = await client.scrape(url="https://example.com")
        print(page.markdown)

asyncio.run(main())

Configuration

Option Default Description
api_key Sent as Authorization: Bearer …. Required — or use from_env().
base_url https://api.crawlbrulee.com Override the target host (local dev / staging). Trailing slashes stripped.
timeout None (no timeout) Per-request timeout in seconds. A per-call timeout= overrides it.

Crawlbrulee.from_env(**overrides) reads the key from CRAWLBRULEE_API_KEY and forwards any other option through.

Both clients support context managers (with / async with) and expose close() / aclose() to release the connection pool.


Request inputs

Top-level request fields are plain keyword arguments. Nested structures are typed dataclasses (importable from crawlbrulee) — or plain dicts, if you prefer:

from crawlbrulee import ScrapeExtract, ScreenshotRequest

client.scrape(
    url="https://news.example.com/article-1",
    extract=ScrapeExtract(
        markdown=True,
        metadata=True,
        links=True,
        screenshot=ScreenshotRequest(type="full_page", device_mode="desktop"),
    ),
    require_js=True,
    proxy="advanced",
    exclude_selectors=["nav", "footer"],
    cache={"max_age": 3600},          # dataclass or dict, your call
    location={"country": "US"},
)

None-valued options are omitted from the request entirely, so the server's defaults apply.


API reference

Every method returns a typed dataclass and accepts a per-call timeout= (seconds).

Scraping

Method Description
scrape(url, **opts) Scrape a URL synchronously; blocks until done.
scrape_async(url, **opts) Submit a background job; returns { job_id } immediately.
get_scrape_status(job_id) Current job state — pending / running / done / failed.
get_scrape_result(job_id) Result of a completed job (raises if not finished).
wait_for_scrape(job_id, interval=2.0, timeout=300.0) Poll until terminal, then return the result.
job = client.scrape_async(url="https://example.com")
page = client.wait_for_scrape(job.job_id, interval=2.0, timeout=300.0)

wait_for_scrape raises a CrawlbruleeError with error_name="job_failed" if the job fails, or error_name="request_timeout" if the wait expires (timeout=0 waits forever).

Mapping

result = client.map(
    url="https://example.com",
    sitemap_only=False,
    types={"internal": True, "external": False},
    max_urls=5_000,
    page=1,
    limit=1_000,
)
print(len(result.links), "of", result.meta.pagination.total_pages, "pages")

Account

Method Description
usage() Current billing-cycle snapshot — credits, quota %, concurrency, reset time.
whoami() Organization + token identity behind the API key.

Errors

Every failure raised by the SDK subclasses CrawlbruleeError:

Class When
AuthenticationError 401 / 403 (missing, invalid, or unauthorized key).
RateLimitError 429. Exposes retry_after_ms and limited_by when provided.
UsageAllocationError Plan limit hit. Exposes reason and usage.
ValidationError Bad request (invalid_url, url_too_long, blocked_url, …).
NotFoundError 404 (e.g. unknown async job_id).
TransportError Network failure, timeout, or non-JSON response.
CrawlbruleeError Base class — any other API error. Always has status, error_name, message.
import time
from crawlbrulee import Crawlbrulee, RateLimitError, UsageAllocationError

client = Crawlbrulee.from_env()
try:
    client.scrape(url="https://example.com")
except RateLimitError as err:
    time.sleep((err.retry_after_ms or 1000) / 1000)
    # retry…
except UsageAllocationError as err:
    print("Plan limit hit:", err.reason, err.usage)

For exhaustive branching, switch on err.error_name.


Notes on the wire format

The SDK mirrors the API's JSON shapes faithfully. The one exception: the async job status response uses camelCase on the wire (jobId, createdAt); the SDK exposes Pythonic job_id / created_at on AsyncJobStatusResponse.


Development

uv sync                 # or: pip install -e ".[dev]"
ruff check . && ruff format --check .
pyright
pytest

The SDK keeps a single runtime dependency (httpx) on purpose — please keep it that way when contributing.

License

AGPL-3.0-only

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlbrulee-0.1.1.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlbrulee-0.1.1-py3-none-any.whl (36.2 kB view details)

Uploaded Python 3

File details

Details for the file crawlbrulee-0.1.1.tar.gz.

File metadata

  • Download URL: crawlbrulee-0.1.1.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crawlbrulee-0.1.1.tar.gz
Algorithm Hash digest
SHA256 77f7115652afd63a787886f9a264b62a809d41eafcb6a5a209f642696218e555
MD5 6f4c904d96c5afea158ec05fc334bee1
BLAKE2b-256 dc5bcb2c2865596873c77024ec15b08790b8cd597bc1d159e863350efca249e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlbrulee-0.1.1.tar.gz:

Publisher: publish.yml on crawlbrulee/crawlbrulee-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file crawlbrulee-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: crawlbrulee-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 36.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crawlbrulee-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b808324e849ee3fbc8c56680d8172b6287308c76abdf1c286bc48c532ad5e713
MD5 a659dd2c2a0018486267b739f6c51bb2
BLAKE2b-256 af1398bba877db02cada6a20800df3e35587c303c35f56ceeabe876fb723be55

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlbrulee-0.1.1-py3-none-any.whl:

Publisher: publish.yml on crawlbrulee/crawlbrulee-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page