Skip to main content

Self-hostable crawling, scraping, mapping, search, and extraction infrastructure.

Project description

CandleCrawl

Open-source crawling, scraping, mapping, search, and extraction infrastructure for developer-heavy research systems.

CI status Security scan status Release artifact status Latest release License

OpenAPI contract Python 3.12 FastAPI Playwright MIT license

CandleCrawl is a self-hostable web ingestion service for teams that need more than a thin scrape endpoint. It combines:

  • direct HTTP retrieval with retry, backoff, and Retry-After handling,
  • Playwright rendering for JS-heavy pages and action-driven workflows,
  • depth-aware crawling with politeness controls and export paths,
  • Firecrawl-style v1 and v2 compatibility surfaces,
  • provider-backed search and extraction helpers,
  • a retired Hermes compatibility surface kept out of package artifacts,
  • a repo structure intended to remain useful as a standalone system or as a substrate inside larger research stacks.

The target audience here is not "people who want a demo". It is engineers, infra-minded researchers, and platform builders who need a service they can inspect, adapt, and embed into higher-level intelligence systems.

Table Of Contents

What CandleCrawl Is For

CandleCrawl exists for cases where "fetch me a page" is too small an abstraction and "run a full browser farm with a huge control plane" is too much overhead.

Typical uses:

  • building research agents that need repeatable scrape, map, crawl, and extract primitives,
  • standing up a local or internal alternative to managed crawl/scrape APIs,
  • feeding downstream retrieval, RAG, indexing, or dossier-generation systems,
  • giving another system a stable HTTP boundary over page acquisition, browser rendering, and crawl policy.

In practice the repo currently serves three adjacent roles:

  1. A standalone crawl/scrape/search/extract API.
  2. A Firecrawl-style compatibility layer for v1 / v2 request shapes.
  3. A provider-enabled substrate used by Hermes through service and SDK boundaries.

Capability Snapshot

Area Status What exists now Notes
Single-page scrape ๐ŸŸข POST /v1/scrape, POST /v2/scrape HTTP-first, Playwright when required
Batch scrape ๐ŸŸข POST /v1/scrape/bulk, POST /v2/batch/scrape Concurrent and cancelable
Site map discovery ๐ŸŸข POST /v1/map, POST /v2/map Includes sitemap probing and link discovery
Async crawl jobs ๐ŸŸข POST /v1/crawl, POST /v2/crawl In-memory or Redis-backed job state
Crawl export ๐ŸŸข /v1/crawl/{id}/export JSONL export path exists today
Search aggregation ๐ŸŸข POST /v2/search Serper-backed web/news/image search
Structured extraction ๐ŸŸข POST /v2/extract Current implementation is scrape-first, structure-second
JS rendering ๐ŸŸข Playwright-backed Startup preflight now surfaces browser readiness
Crawl politeness ๐ŸŸข robots.txt, crawl delay, path rules, budgets Designed for extension, not just happy path demos
Provider abstraction ๐ŸŸข Serper, Scrape.do, OpenRouter Useful both standalone and for Hermes integration
Cost telemetry ๐ŸŸข /v1/hermes/costs/* Research-job oriented, not generic billing
Hermes BCAS research โšช retired /v1/hermes/* compatibility routes Historical bridge code is quarantined under legacy/ and is not packaged
Public contract discipline ๐ŸŸก contracts/openapi-v1.yaml Draft contract, versioned intentionally
Distributed frontier ๐Ÿ”ต MemoryFrontier today Redis/Ray-ready direction documented, not fully shipped

Quick Start

Local Development

git clone https://github.com/kmccleary3301/candlecrawl.git
cd candlecrawl

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ".[service,browser,pdf,ocr,test]"

python -m playwright install chromium
candlecrawl serve --host 0.0.0.0 --port 3010

Package-Oriented Development

CandleCrawl also exposes a lightweight installable package surface for SDK, CLI, and Hermes integration work:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ".[service,browser,pdf,ocr,test]"

candlecrawl version
candlecrawl doctor
candlecrawl serve --host 0.0.0.0 --port 3010

For base SDK-only usage, install without service extras:

pip install -e .
python -c "import candlecrawl, candlecrawl.client; print(candlecrawl.__version__)"

The package CLI is the integration boundary for downstream systems. Service runtime code is packaged under the private candlecrawl._server namespace; top-level app is kept only in the repository as a development compatibility shim for older tests and scripts.

SDK smoke:

from candlecrawl import AsyncCandleCrawlClient
from candlecrawl.schemas import ScrapeRequest

async with AsyncCandleCrawlClient(base_url="http://127.0.0.1:3010") as client:
    result = await client.scrape(
        ScrapeRequest(url="https://example.com", formats=["markdown", "links"])
    )
    print(result.data)

Health check:

curl http://127.0.0.1:3010/health | jq

Expected shape:

{
  "status": "healthy",
  "version": "1.0.0",
  "browserReady": true,
  "browserError": null
}

Docker

docker build -t candlecrawl:local .
docker run --rm -p 3010:3010 candlecrawl:local

First Useful Request

curl -sS http://127.0.0.1:3010/v2/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "links"],
    "onlyMainContent": true
  }' | jq

Installation And Setup

Requirements

Requirement Why it matters
Python 3.12 Matches CI and tested runtime
Playwright Chromium runtime Required for browser-rendered paths
Optional Redis Used for queue/state when available
Optional provider keys Required only for provider-backed search/extract/BCAS paths

Environment Variables

Core runtime settings come from app/config.py and are loaded via .env or process environment.

Variable Default Purpose
HOST 0.0.0.0 API bind host
PORT 3002 API port
REDIS_URL local Redis URL Crawl queue/job persistence when Redis is available
RATE_LIMIT_REQUESTS 100 Rate limit numerator
RATE_LIMIT_WINDOW 60 Rate limit window in seconds
DEFAULT_TIMEOUT 30 Default HTTP timeout
MAX_CONCURRENT_REQUESTS 5 Concurrency cap for internal async work
USER_AGENT Chrome-like UA Default fetch/browser UA
RETRY_MAX_ATTEMPTS 3 HTTP retry ceiling
BACKOFF_BASE_MS 200 Retry backoff base
BACKOFF_MAX_MS 3000 Retry backoff cap
CACHE_ENABLED true Cache toggle for Firecrawl-style semantics
CACHE_DEFAULT_MAX_AGE_MS 172800000 Default cache TTL in milliseconds

Optional provider configuration:

Variable Used by Needed for
SERPER_DEV_API_KEY v2/search, Hermes search flows Provider-backed search
SCRAPE_DO_API_KEY external fallback routes Secondary fetch/render fallback
OPENROUTER_API_KEY Hermes compose/BCAS flows LLM-backed research synthesis
OPENAI_API_KEY optional extract-related work Future extraction integrations

Browser Runtime Preflight

The service now checks Playwright Chromium availability at startup. If the browser runtime is missing, /health degrades cleanly and reports a concrete fix:

Playwright Chromium runtime missing; run python -m playwright install chromium

That is deliberate. A broken browser runtime should be obvious before you discover it mid-scrape.

For more operational detail, see docs/GETTING_STARTED.md.

Configuration

Request Shaping

Scrape and crawl requests support:

  • format selection: markdown, html, rawHtml, links, screenshot,
  • tag inclusion/exclusion,
  • main-content filtering,
  • custom headers,
  • wait/timeout controls,
  • mobile emulation,
  • optional browser actions for v2/scrape,
  • crawl include/exclude paths,
  • subdomain/external-link policy,
  • dedupe and query-parameter handling,
  • time/byte/page/concurrency budgets.

Storage And Queue Behavior

Mode Behavior
Redis available Crawl job metadata and queues use Redis/RQ
Redis unavailable Service falls back to in-memory job storage
Browser unavailable Browser-backed routes degrade and health exposes the cause

Compatibility Positioning

Surface Intent
v1/* Lightweight Firecrawl-style compatibility and existing callers
v2/* Cleaner evolving contract surface
/v1/hermes/* Retired compatibility surface; returns 410

API Surface

Core Endpoints

Endpoint Method Purpose Notes
/health GET Service health Includes browser runtime readiness
/v1/scrape POST Scrape one URL Classic Firecrawl-style request
/v2/scrape POST Scrape one URL with richer v2 semantics Supports actions and Firecrawl-style payload mapping
/v1/scrape/bulk POST Bulk scrape Returns per-URL responses
/v2/batch/scrape POST Async batch scrape Poll, cancel, inspect errors
/v1/map POST Discover links from a root URL Lightweight mapping surface
/v2/map POST Discover links with v2 payload Search/filter aware
/v1/crawl POST Start crawl job Async crawl lifecycle
/v2/crawl POST Start crawl job with v2 contract Includes idempotency support
/v2/search POST Provider-backed search Web, news, image search today
/v2/extract POST Extract structured docs from URLs Current implementation is URL scrape aggregation

Crawl Lifecycle Endpoints

Endpoint Method Purpose
/v1/crawl/{job_id} GET Poll crawl status
/v1/crawl/{job_id}/cancel POST Cancel crawl
/v1/crawl/{job_id}/export GET Export results
/v2/crawl/{job_id} GET Poll v2 crawl job
/v2/crawl/{job_id} DELETE Cancel v2 crawl job
/v2/crawl/{job_id}/errors GET Error inspection
/v2/batch/scrape/{job_id} GET Poll batch scrape
/v2/batch/scrape/{job_id} DELETE Cancel batch scrape
/v2/batch/scrape/{job_id}/errors GET Error inspection

Hermes Bridge Endpoints

These are not the main public story of CandleCrawl. The former Hermes bridge routes are retired and return a 410 compatibility envelope.

Endpoint Method Purpose
/v1/hermes/* varied Retired compatibility surface; use Hermes-owned APIs for higher-level research workflows

Contract Artifact

The evolving draft contract lives at:

That file is part of CI, release discipline, and consumer-compatibility signaling. It is not filler.

Usage Examples

1. Minimal v2 scrape

curl -sS http://127.0.0.1:3010/v2/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "html", "links"],
    "onlyMainContent": true,
    "timeout": 15000
  }'

2. Action-driven browser scrape

curl -sS http://127.0.0.1:3010/v2/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "actions": [
      {
        "type": "evaluate",
        "script": "document.body.insertAdjacentHTML(\"beforeend\", \"<p>runtime-marker</p>\");"
      }
    ]
  }'

3. Start an async crawl

curl -sS http://127.0.0.1:3010/v2/crawl \
  -H 'Content-Type: application/json' \
  -H 'X-Idempotency-Key: demo-crawl-001' \
  -d '{
    "url": "https://example.com",
    "limit": 25,
    "maxDepth": 2,
    "includeSubdomains": false,
    "allowExternalLinks": false
  }'

Poll it:

curl -sS http://127.0.0.1:3010/v2/crawl/<job_id> | jq

4. Discover URLs with map

curl -sS http://127.0.0.1:3010/v2/map \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "limit": 250,
    "search": "blog"
  }'

5. Provider-backed search

curl -sS http://127.0.0.1:3010/v2/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "site:openai.com reasoning models",
    "limit": 5,
    "sources": [{"type": "web"}, {"type": "news"}]
  }'

6. Multi-URL extract

curl -sS http://127.0.0.1:3010/v2/extract \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": [
      "https://example.com",
      "https://www.iana.org/domains/reserved"
    ],
    "scrapeOptions": {
      "formats": ["markdown"],
      "onlyMainContent": true
    }
  }'

7. Retired Hermes BCAS research call

curl -sS http://127.0.0.1:3010/v1/hermes/research \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "Map recent memory-augmented model work associated with Ali Behrouz",
    "tier": "TARGETED",
    "max_searches": 4,
    "model": "openai/gpt-5-nano",
    "use_preplanning": true
  }'

Python Example

import httpx

payload = {
    "url": "https://example.com",
    "formats": ["markdown", "links"],
    "onlyMainContent": True,
}

with httpx.Client(base_url="http://127.0.0.1:3010", timeout=30.0) as client:
    response = client.post("/v2/scrape", json=payload)
    response.raise_for_status()
    data = response.json()

print(data["success"])
print(data["data"]["metadata"]["title"])
print(data["data"]["markdown"][:200])

Architecture Notes

At a high level, CandleCrawl is built around a deliberately small set of moving parts:

  1. FastAPI request layer
    • request validation,
    • compatibility shims,
    • rate limiting,
    • job lifecycle endpoints.
  2. Scraping service
    • HEAD/GET probing,
    • HTML/file routing,
    • Playwright escalation,
    • markdown/html/link extraction.
  3. Crawl frontier
    • currently in-memory,
    • intentionally shaped for Redis-backed replacement.
  4. Provider adapters
    • Serper for search,
    • Scrape.do for fallback scraping,
    • OpenRouter for Hermes composition/BCAS paths.
  5. Operational layers
    • browser runtime preflight,
    • queue fallback when Redis is absent,
    • cost tracking,
    • contract and CI discipline.

Request flow, simplified:

Client
  -> FastAPI endpoint
    -> request normalization / rate limiting
      -> scrape | map | crawl | search | extract path
        -> HTTP client and/or Playwright
          -> extraction + normalization
            -> response envelope / job persistence / export

The service intentionally keeps the frontier, provider clients, and browser rendering logic separate enough that you can swap internals without changing the HTTP contract every week.

For deeper discussion, see:

Repository Map

candlecrawl/
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ main.py                  # FastAPI app, compatibility routes, crawl lifecycle
โ”‚   โ”œโ”€โ”€ scraper.py               # HTTP-first scraping + Playwright escalation
โ”‚   โ”œโ”€โ”€ models.py                # Pydantic request/response contracts
โ”‚   โ”œโ”€โ”€ frontier.py              # Crawl frontier abstraction; in-memory implementation today
โ”‚   โ”œโ”€โ”€ http_client.py           # Retry/backoff HTTP client wrapper
โ”‚   โ”œโ”€โ”€ config.py                # Environment-backed runtime settings
โ”‚   โ”œโ”€โ”€ metrics.py               # Prometheus metric primitives
โ”‚   โ”œโ”€โ”€ cost_tracking.py         # Provider and stage-level cost accounting
โ”‚   โ”œโ”€โ”€ cost_endpoints.py        # Hermes cost telemetry API
โ”‚   โ”œโ”€โ”€ model_pricing.py         # Model cost estimation helpers
โ”‚   โ”œโ”€โ”€ chunking.py              # Search/research-oriented chunking helpers
โ”‚   โ”œโ”€โ”€ providers/
โ”‚   โ”‚   โ”œโ”€โ”€ base.py              # Shared provider exception types
โ”‚   โ”‚   โ”œโ”€โ”€ serper.py            # Search/news/image provider client
โ”‚   โ”‚   โ”œโ”€โ”€ scrapedo.py          # Scrape.do fallback client
โ”‚   โ”‚   โ””โ”€โ”€ openrouter.py        # LLM provider client for BCAS and compose flows
โ”œโ”€โ”€ legacy/
โ”‚   โ””โ”€โ”€ hermes_bcas.py           # quarantined historical BCAS bridge; not packaged
โ”‚   โ””โ”€โ”€ scripts/
โ”‚       โ””โ”€โ”€ provider_smoketests.py
โ”œโ”€โ”€ contracts/
โ”‚   โ””โ”€โ”€ openapi-v1.yaml          # Draft public contract artifact tracked by CI
โ”œโ”€โ”€ docs/
โ”‚   โ”œโ”€โ”€ GETTING_STARTED.md       # Installation, env, smoke checks, troubleshooting
โ”‚   โ”œโ”€โ”€ ARCHITECTURE.md          # Internal structure, request flow, queue model
โ”‚   โ”œโ”€โ”€ API_AND_OPERATIONS.md    # Endpoint catalog, examples, operational behaviors
โ”‚   โ””โ”€โ”€ BRANCH_PROTECTION_POLICY.md
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_api.py              # Endpoint and response-shape coverage
โ”‚   โ”œโ”€โ”€ test_frontier.py         # Frontier behavior
โ”‚   โ”œโ”€โ”€ test_http_client.py      # Retry/backoff semantics
โ”‚   โ”œโ”€โ”€ test_crawl_policies.py   # Crawl policy rules
โ”‚   โ”œโ”€โ”€ test_crawl_extreme.py    # Stress-ish edge cases
โ”‚   โ”œโ”€โ”€ test_hermes_endpoints.py # Hermes bridge routes
โ”‚   โ”œโ”€โ”€ test_providers.py        # Provider client coverage
โ”‚   โ”œโ”€โ”€ test_performance.py      # Performance-oriented checks
โ”‚   โ””โ”€โ”€ test_scraper_runtime.py  # Browser runtime preflight and error normalization
โ”œโ”€โ”€ .github/workflows/
โ”‚   โ”œโ”€โ”€ ci.yml
โ”‚   โ”œโ”€โ”€ security.yml
โ”‚   โ”œโ”€โ”€ release-artifact.yml
โ”‚   โ””โ”€โ”€ consumer-compat-dispatch.yml
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ QueryLake_Integration_Plan.md
โ”œโ”€โ”€ CHANGELOG.md
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ VERSIONING.md
โ””โ”€โ”€ README.md

Testing, CI, And Release Discipline

Local Test Commands

pytest -q

Focused commands:

pytest -q tests/test_api.py tests/test_frontier.py tests/test_http_client.py
pytest -q tests/test_scraper_runtime.py

Optional provider smoke tests:

python -m app.scripts.provider_smoketests

CI And Policy Surface

Check Purpose
CandleCrawl CI / unit-and-api Core unit/API behavior
CandleCrawl CI / contract-validation Contract artifact presence and required path validation
CandleCrawl Security Scan / pip-audit Dependency audit
CandleCrawl Release Artifact Image build + artifact export on tags
CandleCrawl Consumer Compatibility Dispatch Optional downstream compatibility signaling

Branch protection policy is documented at docs/BRANCH_PROTECTION_POLICY.md.

Versioning

CandleCrawl follows semantic versioning. See VERSIONING.md.

Documentation Index

Document What it covers
docs/GETTING_STARTED.md Setup, env vars, first boot, smoke checks
docs/ARCHITECTURE.md Runtime design, module responsibilities, queue and render model
docs/API_AND_OPERATIONS.md Endpoint catalog, request examples, health and troubleshooting
docs/RELEASE_RUNBOOK.md Package, contract, clean-install, and Hermes candidate release gates
QueryLake_Integration_Plan.md Planned QueryLake/Ray alignment
CONTRIBUTING.md Contribution workflow and contract-change discipline
CHANGELOG.md Release history
VERSIONING.md Versioning policy

Current Design Boundaries

Some constraints are deliberate:

  • CandleCrawl is not pretending to be a full browser-farm orchestration platform.
  • The current frontier implementation is intentionally simple and forward-compatible rather than prematurely distributed.
  • v2/extract is useful today, but it is still closer to scrape aggregation than a final-form schema-first extraction engine.
  • Hermes compatibility routes exist because they are operationally useful, not because they define the whole repo.
  • The contract artifact is still marked draft; compatibility matters, but the surface is still evolving.

That is the right tradeoff for the current project stage. The system is already useful, inspectable, and extensible without pretending it is finished.

Contributing

Start with CONTRIBUTING.md. The short version:

  • branch from main,
  • add tests for behavior changes,
  • keep contract-impacting changes explicit,
  • run the relevant test suite before opening a PR.

If your change affects endpoint semantics, update contracts/openapi-v1.yaml and note the compatibility implications.

License

CandleCrawl is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

candlecrawl-0.1.0a1.tar.gz (101.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

candlecrawl-0.1.0a1-py3-none-any.whl (87.2 kB view details)

Uploaded Python 3

File details

Details for the file candlecrawl-0.1.0a1.tar.gz.

File metadata

  • Download URL: candlecrawl-0.1.0a1.tar.gz
  • Upload date:
  • Size: 101.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for candlecrawl-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 67dd272007d0935a5b07297a19a785f24d951415a8c3a5f60e558b83e2496a6e
MD5 2566bb53c7bb694480128effbd98ced4
BLAKE2b-256 0973595730844e6ed92384717c5edf27e34ac91ebded15ff354420f4eb736c88

See more details on using hashes here.

File details

Details for the file candlecrawl-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: candlecrawl-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 87.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for candlecrawl-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 ada5662e7dac3eba84da2451c841ae647cbc921f7a00caac319de85a2b30a985
MD5 6f39aae3c7ffcffc5c2f558e000e608a
BLAKE2b-256 d29d38ff4356ddc30541e2276a007b360c444364467b4a2a1347c60f6859f083

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page