Hacker News adapter for the Ladon crawler framework.
Project description
ladon-hackernews
Hacker News adapter for the Ladon crawler framework.
Crawls the HN top-stories list, expands each story into its direct comments, and persists everything to a DuckDB database — ready to export as Parquet for LLM training pipelines or downstream analysis.
Quick start
pip install ladon-hackernews
ladon-hackernews --top 30 --out hn.db
No authentication. No external server.
What you get
Each run writes two tables to hn.db:
hn_comments — one row per HN comment (the leaf record):
| column | type | description |
|---|---|---|
id |
INTEGER | HN comment ID |
story_id |
INTEGER | Parent story ID |
parent_id |
INTEGER | Immediate parent (story or comment) |
by |
TEXT | Author username |
text |
TEXT | Raw HTML comment body |
time |
TIMESTAMPTZ | Comment timestamp (UTC) |
run_id |
TEXT | UUID of the crawl run that wrote this row |
ladon_runs — one row per story crawled (upserted twice: at start and finish):
| column | type | description |
|---|---|---|
run_id |
TEXT | UUID for this crawl run |
plugin_name |
TEXT | Always "hackernews" |
top_ref |
TEXT | HN item URL that was the root of this run |
started_at |
TIMESTAMPTZ | When the run started (UTC) |
finished_at |
TIMESTAMPTZ | When the run finished; NULL while running |
status |
TEXT | done, partial, not_ready, failed, or running |
leaves_consumed |
INTEGER | Comments for which sink.consume() succeeded |
leaves_persisted |
INTEGER | Comments successfully written to hn_comments |
leaves_failed |
INTEGER | Comments that failed to fetch or persist |
branch_errors |
INTEGER | Expander-level errors (branch could not be expanded) |
errors |
TEXT | JSON array of error message strings |
Sample DuckDB query
-- Top commenters across all crawled stories
SELECT "by", COUNT(*) AS comments
FROM hn_comments
GROUP BY "by"
ORDER BY comments DESC
LIMIT 10;
HN comments are structured, human-authored, and high signal-to-noise — a useful corpus for instruction tuning and dialogue modelling. A typical pipeline looks like:
ladon-hackernews --top 500 --out hn.db
→ export_parquet("hn.db", "hn.parquet")
→ training pipeline
Export to Parquet
from ladon_hackernews import export_parquet
export_parquet("hn.db", "hn.parquet")
CLI reference
ladon-hackernews [--top N] [--out PATH] [--verbose]
| flag | default | description |
|---|---|---|
--top N |
30 |
Number of top stories to crawl (range: 1–500) |
--out PATH |
hn.db |
Output DuckDB database path |
--verbose, -v |
off | Show DEBUG-level framework messages (leaf warnings, HTTP timings) |
In default mode the terminal shows one progress line per story and a summary at
the end. Framework-level noise (leaf unavailable, expander branch failed) is
suppressed; partial stories print a ↳ hint instead. Pass --verbose to expose
the raw framework log messages.
Use as a library
The CLI is the simplest way to run a crawl, but HNPlugin and
HNDuckDBRepository can be used directly as a library — useful when you need
custom scheduling, batching, or integration into a larger pipeline.
import uuid
from datetime import datetime, timezone
from ladon.networking.client import HttpClient
from ladon.networking.config import HttpClientConfig
from ladon.persistence import RunAudit, RunRecord
from ladon.plugins.errors import ExpansionNotReadyError
from ladon.plugins.models import Ref
from ladon.runner import RunConfig, run_crawl
from ladon_hackernews import HNPlugin, HNDuckDBRepository
plugin = HNPlugin(top=10)
config = RunConfig()
client_config = HttpClientConfig(user_agent="my-bot/1.0")
with HNDuckDBRepository("hn.db") as repo, HttpClient(client_config) as client:
for story_ref in plugin.source.discover(client):
if not isinstance(story_ref, Ref):
raise TypeError(f"unexpected type {type(story_ref).__name__}")
run_id = str(uuid.uuid4())
run = RunRecord(
run_id=run_id,
plugin_name=plugin.name,
top_ref=story_ref.url,
started_at=datetime.now(tz=timezone.utc),
status="running",
)
if isinstance(repo, RunAudit):
repo.record_run(run)
try:
result = run_crawl(
story_ref, plugin, client, config,
# Default-argument capture binds run_id to each lambda.
on_leaf=lambda rec, _, _id=run_id: repo.write_leaf(rec, _id),
)
run.branch_errors = sum(
1 for e in result.errors if e.startswith("expander branch")
)
run.status = (
"partial"
if result.leaves_failed
or result.leaves_consumed > result.leaves_persisted
or run.branch_errors
else "done"
)
run.leaves_consumed = result.leaves_consumed
run.leaves_persisted = result.leaves_persisted
run.leaves_failed = result.leaves_failed
run.errors = result.errors
except ExpansionNotReadyError:
run.status = "not_ready"
except Exception as exc:
run.status = "failed"
run.errors = (str(exc),)
finally:
run.finished_at = datetime.now(tz=timezone.utc)
if isinstance(repo, RunAudit):
repo.record_run(run)
How it works
This adapter implements the Ladon SES (Source / Expander / Sink) protocol against the HN Firebase API.
Pipeline
flowchart TB
subgraph plugin ["SES Plugin"]
direction LR
SRC["HNSource\ndiscover()"] -- "Ref × N" --> EXP["HNExpander\nexpand()"] -- "StoryRecord\nRef × M" --> SNK["HNSink\nconsume()"]
end
subgraph persistence ["Persistence"]
direction LR
REPO["HNDuckDBRepository"] --> DB[("hn.db")] -- "export_parquet()" --> PQ[("hn.parquet")]
end
SNK -- "CommentRecord" --> REPO
Domain records
flowchart LR
SR["StoryRecord\n─────────────\nid · title · url\nby · score · time\ndescendants · comment_ids"]
CR["CommentRecord\n─────────────\nid · story_id · parent_id\nby · text · time"]
SR -- "expands into" --> CR
SES class map
| Layer | Class | HN API call |
|---|---|---|
Source |
HNSource |
GET /v0/topstories.json → story ID list |
Expander |
HNExpander |
GET /v0/item/{story_id}.json → comment refs |
Sink |
HNSink |
GET /v0/item/{comment_id}.json → CommentRecord |
HNDuckDBRepository implements both Repository (leaf persistence) and
RunAudit (run history) from ladon.persistence
— structurally, with no Ladon base class imported.
Each story is one independent run. The run audit trail lets you resume from the last successful crawl:
last = repo.get_last_run("hackernews") # most recent "done" run
Writing your own adapter
ladon-hackernews is the canonical reference for building a Ladon adapter.
See the Ladon documentation and
ADR-003
for the full adapter authoring guide.
Key pattern: adapters implement Ladon protocols structurally — no
inheritance from any Ladon base class is required. Only RunRecord needs
to be imported for RunAudit implementations.
Development
git clone https://github.com/MoonyFringers/ladon-hackernews
cd ladon-hackernews
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v
See CONTRIBUTING.md for full guidelines.
License
Apache-2.0 — see LICENSE.
The Ladon core framework is
AGPL-3.0-only. ladon-hackernews is Apache-2.0 but has a runtime
dependency on Ladon core; review the AGPL terms if you plan to distribute
or run this as a network service.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ladon_hackernews-0.0.1.tar.gz.
File metadata
- Download URL: ladon_hackernews-0.0.1.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76e85b039a2424a6e0fe52317e2ae4a83c7d9c79220158965d80edaad293b7df
|
|
| MD5 |
2cf47a71ab981fce99960a6327ca5b9d
|
|
| BLAKE2b-256 |
df54ff12abd6d94f567134dc299f06c87657b7b656200a80a901ee843685ae41
|
File details
Details for the file ladon_hackernews-0.0.1-py3-none-any.whl.
File metadata
- Download URL: ladon_hackernews-0.0.1-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f724e943eac6e16e54eafb70523abef35c893c9262ed696250033ae7d798ab90
|
|
| MD5 |
b6228e7d400d06cf597ba11facf80b16
|
|
| BLAKE2b-256 |
e7de846652e83c5c4a0b0b94c8682c287e4d4508d87876e8dc154993cde7cccb
|