Skip to main content

A structured, resumable web crawling framework for AI-ready datasets.

Project description

Ladon

CI Lint Python 3.11+ License: AGPL-3.0-only

A Python framework for building structured, resumable web crawlers — designed for domains where data quality matters.

What is Ladon?

Ladon enforces typed domain objects at every stage of the crawl pipeline through the SES protocol (Source / Expander / Sink). The difference from Scrapy — a proven, mature tool — is structural: instead of weakly typed scrapy.Item fields, you define typed dataclasses at the protocol level (e.g. a CommentRecord with enforced field types). The output is structured and typed without a post-processing step. This matters when the destination is an LLM training pipeline or any domain where schema correctness is not optional.

The built-in HTTP layer handles retries, exponential back-off, per-domain rate limiting, circuit breaking, and robots.txt enforcement — so adapter authors focus on domain logic, not infrastructure.

Quick start

The canonical example is ladon-hackernews — an adapter that crawls the HN top-stories list and writes comments to DuckDB:

# Install Ladon core (until ladon-crawl lands on PyPI, install from source)
pip install git+https://github.com/MoonyFringers/ladon.git
pip install git+https://github.com/MoonyFringers/ladon-hackernews.git
ladon-hackernews --top 30 --out hn.db

Once on PyPI (v0.0.1): pip install ladon-crawl (or pip install ladon-crawl ladon-hackernews for the HN example)

No authentication. No external server. 30 stories and their comments in under a minute.

The LLM training pipeline

ladon-hackernews --top 500 --out hn.db
    → export_parquet("hn.db", "hn.parquet")
        → training pipeline

HN comments are structured, human-authored, and high signal-to-noise. The full pipeline from install to Parquet takes under five minutes. Each run writes a ladon_runs audit table to the DuckDB file — re-running skips stories already marked done, giving you resumable crawls for free.

from ladon_hackernews import export_parquet
export_parquet("hn.db", "hn.parquet")

Writing your own adapter

ladon-hackernews is the canonical reference for building an adapter. Adapters implement the SES protocol structurally — no inheritance from any Ladon base class is required. The three components to implement are:

  • Source — discovers the list of root references to crawl
  • Expander — maps a reference to a domain record and child references
  • Sink — receives each leaf record for persistence or downstream use

See the adapter authoring guide and ADR-003 for the full protocol specification. The ladon-hackernews source is the worked example.

CLI reference

ladon info
ladon run --plugin MODULE:CLASS --ref URL [--respect-robots-txt]
ladon --version
command description
ladon info Print Ladon version, Python version, and platform
ladon run Run a crawl using a dynamically loaded plugin class
ladon --version Print the installed version

ladon run flags:

flag required description
--plugin MODULE:CLASS yes Dotted import path to the CrawlPlugin class
--ref URL yes Top-level reference URL passed to the plugin
--respect-robots-txt no Honour Disallow rules and Crawl-delay directives

Exit codes: 0 success · 1 fatal error · 2 partial failures · 3 data not ready (retry later)

ladon run uses default HttpClientConfig settings. For retries, rate limiting, circuit breaking, or a persistence layer, call run_crawl() directly from Python — see ladon-hackernews — Use as a library for a full example.

Status

v0.0.1 — alpha. The SES protocol and HTTP layer are stable. One reference adapter (ladon-hackernews) is available as open source and tested against the real HN API.

What is in v0.0.1:

  • SES protocol (Source / Expander / Sink) with structural typing
  • run_crawl() runner with leaf isolation and RunResult summary
  • HttpClient with retries, back-off, rate limiting, circuit breaker, robots.txt
  • Storage protocol with LocalFileStorage
  • Repository and RunAudit persistence protocols with NullRepository
  • ladon run / ladon info CLI

What is coming in v0.1.0:

  • RunResult counter semantics redesign (issue #62)
  • Structured logging baseline (ADR-009)

Contributing

The plugin protocol is settled — contributions are welcome. Please read the documentation for design context (ADRs, plugin authoring guide) before sending a pull request.

A CLA signature is required for external contributors. The cla-assistant bot will prompt you on your first PR.

License

Ladon is released under the GNU Affero General Public License v3.0 only (AGPL-3.0-only). See LICENSE for the full text.

AGPL was chosen to ensure that improvements to the core framework — including when deployed as a networked service — remain open and available to the community. A commercial licence is available for organisations that cannot accept the AGPL terms — see LICENSE-COMMERCIAL.

ladon-hackernews is separately licensed under Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ladon_crawl-0.0.1.tar.gz (58.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ladon_crawl-0.0.1-py3-none-any.whl (51.1 kB view details)

Uploaded Python 3

File details

Details for the file ladon_crawl-0.0.1.tar.gz.

File metadata

  • Download URL: ladon_crawl-0.0.1.tar.gz
  • Upload date:
  • Size: 58.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_crawl-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e4a26c0da64924cef29041861de2613f1e05ee66fd4c3613f48c302f7f536154
MD5 39ee9fd7bcf9270ba07cac7510f11094
BLAKE2b-256 5a436ddd2396da39bdfa1e4cf93d70aa2363e6afd4c77976c8d99974bb1fb4a4

See more details on using hashes here.

File details

Details for the file ladon_crawl-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: ladon_crawl-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 51.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_crawl-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3ead7dc7e666003660ad89564864868d1c7f277f0c1c0236c70af4357d96e986
MD5 607bbd81095b109d504346fb745dc7bb
BLAKE2b-256 974759fe01051586e9eb347ad978cce7c07436b1f5fe766cd0fecb1ca012c4ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page