A structured, resumable web crawling framework for AI-ready datasets.
Project description
Ladon
A Python framework for building structured, resumable web crawlers — designed for domains where data quality matters.
What is Ladon?
Ladon enforces typed domain objects at every stage of the crawl pipeline
through the SES protocol (Source / Expander / Sink). The difference from
Scrapy — a proven, mature tool — is structural: instead of weakly typed
scrapy.Item fields, you define typed dataclasses at the protocol level
(e.g. a CommentRecord with enforced field types). The output is structured
and typed without a post-processing step. This matters when the destination
is an LLM training pipeline or any domain where schema correctness is not optional.
The built-in HTTP layer handles retries, exponential back-off, per-domain rate limiting, circuit breaking, and robots.txt enforcement — so adapter authors focus on domain logic, not infrastructure.
Quick start
The canonical example is
ladon-hackernews —
an adapter that crawls the HN top-stories list and writes comments to DuckDB:
# Install Ladon core (until ladon-crawl lands on PyPI, install from source)
pip install git+https://github.com/MoonyFringers/ladon.git
pip install git+https://github.com/MoonyFringers/ladon-hackernews.git
ladon-hackernews --top 30 --out hn.db
Once on PyPI (v0.0.1):
pip install ladon-crawl(orpip install ladon-crawl ladon-hackernewsfor the HN example)
No authentication. No external server. 30 stories and their comments in under a minute.
The LLM training pipeline
ladon-hackernews --top 500 --out hn.db
→ export_parquet("hn.db", "hn.parquet")
→ training pipeline
HN comments are structured, human-authored, and high signal-to-noise. The
full pipeline from install to Parquet takes under five minutes. Each run
writes a ladon_runs audit table to the DuckDB file — re-running skips
stories already marked done, giving you resumable crawls for free.
from ladon_hackernews import export_parquet
export_parquet("hn.db", "hn.parquet")
Writing your own adapter
ladon-hackernews is the canonical reference for building an adapter.
Adapters implement the SES protocol structurally — no inheritance from
any Ladon base class is required. The three components to implement are:
- Source — discovers the list of root references to crawl
- Expander — maps a reference to a domain record and child references
- Sink — receives each leaf record for persistence or downstream use
See the adapter authoring guide and
ADR-003
for the full protocol specification. The
ladon-hackernews source
is the worked example.
CLI reference
ladon info
ladon run --plugin MODULE:CLASS --ref URL [--respect-robots-txt]
ladon --version
| command | description |
|---|---|
ladon info |
Print Ladon version, Python version, and platform |
ladon run |
Run a crawl using a dynamically loaded plugin class |
ladon --version |
Print the installed version |
ladon run flags:
| flag | required | description |
|---|---|---|
--plugin MODULE:CLASS |
yes | Dotted import path to the CrawlPlugin class |
--ref URL |
yes | Top-level reference URL passed to the plugin |
--respect-robots-txt |
no | Honour Disallow rules and Crawl-delay directives |
Exit codes: 0 success · 1 fatal error · 2 partial failures · 3 data not ready (retry later)
ladon run uses default HttpClientConfig settings. For retries, rate
limiting, circuit breaking, or a persistence layer, call run_crawl()
directly from Python — see
ladon-hackernews — Use as a library
for a full example.
Status
v0.0.1 — alpha. The SES protocol and HTTP layer are stable. One reference
adapter (ladon-hackernews) is available as open source and tested against
the real HN API.
What is in v0.0.1:
- SES protocol (Source / Expander / Sink) with structural typing
run_crawl()runner with leaf isolation andRunResultsummaryHttpClientwith retries, back-off, rate limiting, circuit breaker, robots.txtStorageprotocol withLocalFileStorageRepositoryandRunAuditpersistence protocols withNullRepositoryladon run/ladon infoCLI
What is coming in v0.1.0:
- RunResult counter semantics redesign (issue #62)
- Structured logging baseline (ADR-009)
Contributing
The plugin protocol is settled — contributions are welcome. Please read the documentation for design context (ADRs, plugin authoring guide) before sending a pull request.
A CLA signature is required for external contributors. The
cla-assistant
bot will prompt you on your first PR.
License
Ladon is released under the GNU Affero General Public License v3.0 only
(AGPL-3.0-only). See LICENSE for the full text.
AGPL was chosen to ensure that improvements to the core framework — including
when deployed as a networked service — remain open and available to the
community. A commercial licence is available for organisations that cannot
accept the AGPL terms — see LICENSE-COMMERCIAL.
ladon-hackernews is separately licensed under Apache-2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ladon_crawl-0.0.1.tar.gz.
File metadata
- Download URL: ladon_crawl-0.0.1.tar.gz
- Upload date:
- Size: 58.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4a26c0da64924cef29041861de2613f1e05ee66fd4c3613f48c302f7f536154
|
|
| MD5 |
39ee9fd7bcf9270ba07cac7510f11094
|
|
| BLAKE2b-256 |
5a436ddd2396da39bdfa1e4cf93d70aa2363e6afd4c77976c8d99974bb1fb4a4
|
File details
Details for the file ladon_crawl-0.0.1-py3-none-any.whl.
File metadata
- Download URL: ladon_crawl-0.0.1-py3-none-any.whl
- Upload date:
- Size: 51.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ead7dc7e666003660ad89564864868d1c7f277f0c1c0236c70af4357d96e986
|
|
| MD5 |
607bbd81095b109d504346fb745dc7bb
|
|
| BLAKE2b-256 |
974759fe01051586e9eb347ad978cce7c07436b1f5fe766cd0fecb1ca012c4ab
|