Skip to main content

Hacker News adapter for the Ladon crawler framework.

Project description

ladon-hackernews

CI License: Apache 2.0 Python 3.11+

Hacker News adapter for the Ladon crawler framework.

Crawls the HN top-stories list, expands each story into its direct comments, and persists everything to a DuckDB database — ready to export as Parquet for LLM training pipelines or downstream analysis.

Quick start

pip install ladon-hackernews
ladon-hackernews --top 30 --out hn.db

No authentication. No external server.

What you get

Each run writes two tables to hn.db:

hn_comments — one row per HN comment (the leaf record):

column type description
id INTEGER HN comment ID
story_id INTEGER Parent story ID
parent_id INTEGER Immediate parent (story or comment)
by TEXT Author username
text TEXT Raw HTML comment body
time TIMESTAMPTZ Comment timestamp (UTC)
run_id TEXT UUID of the crawl run that wrote this row

ladon_runs — one row per story crawled (upserted twice: at start and finish):

column type description
run_id TEXT UUID for this crawl run
plugin_name TEXT Always "hackernews"
top_ref TEXT HN item URL that was the root of this run
started_at TIMESTAMPTZ When the run started (UTC)
finished_at TIMESTAMPTZ When the run finished; NULL while running
status TEXT done, partial, not_ready, failed, or running
leaves_consumed INTEGER Comments for which sink.consume() succeeded
leaves_persisted INTEGER Comments successfully written to hn_comments
leaves_failed INTEGER Comments that failed to fetch or persist
branch_errors INTEGER Expander-level errors (branch could not be expanded)
errors TEXT JSON array of error message strings

Sample DuckDB query

-- Top commenters across all crawled stories
SELECT "by", COUNT(*) AS comments
FROM hn_comments
GROUP BY "by"
ORDER BY comments DESC
LIMIT 10;

HN comments are structured, human-authored, and high signal-to-noise — a useful corpus for instruction tuning and dialogue modelling. A typical pipeline looks like:

ladon-hackernews --top 500 --out hn.db
    → export_parquet("hn.db", "hn.parquet")
        → training pipeline

Export to Parquet

from ladon_hackernews import export_parquet

export_parquet("hn.db", "hn.parquet")

CLI reference

ladon-hackernews [--top N] [--out PATH] [--verbose]
flag default description
--top N 30 Number of top stories to crawl (range: 1–500)
--out PATH hn.db Output DuckDB database path
--verbose, -v off Show DEBUG-level framework messages (leaf warnings, HTTP timings)

In default mode the terminal shows one progress line per story and a summary at the end. Framework-level noise (leaf unavailable, expander branch failed) is suppressed; partial stories print a hint instead. Pass --verbose to expose the raw framework log messages.


Use as a library

The CLI is the simplest way to run a crawl, but HNPlugin and HNDuckDBRepository can be used directly as a library — useful when you need custom scheduling, batching, or integration into a larger pipeline.

import uuid
from datetime import datetime, timezone

from ladon.networking.client import HttpClient
from ladon.networking.config import HttpClientConfig
from ladon.persistence import RunAudit, RunRecord
from ladon.plugins.errors import ExpansionNotReadyError
from ladon.plugins.models import Ref
from ladon.runner import RunConfig, run_crawl
from ladon_hackernews import HNPlugin, HNDuckDBRepository

plugin = HNPlugin(top=10)
config = RunConfig()
client_config = HttpClientConfig(user_agent="my-bot/1.0")

with HNDuckDBRepository("hn.db") as repo, HttpClient(client_config) as client:
    for story_ref in plugin.source.discover(client):
        if not isinstance(story_ref, Ref):
            raise TypeError(f"unexpected type {type(story_ref).__name__}")
        run_id = str(uuid.uuid4())
        run = RunRecord(
            run_id=run_id,
            plugin_name=plugin.name,
            top_ref=story_ref.url,
            started_at=datetime.now(tz=timezone.utc),
            status="running",
        )
        if isinstance(repo, RunAudit):
            repo.record_run(run)

        try:
            result = run_crawl(
                story_ref, plugin, client, config,
                # Default-argument capture binds run_id to each lambda.
                on_leaf=lambda rec, _, _id=run_id: repo.write_leaf(rec, _id),
            )
            run.branch_errors = sum(
                1 for e in result.errors if e.startswith("expander branch")
            )
            run.status = (
                "partial"
                if result.leaves_failed
                or result.leaves_consumed > result.leaves_persisted
                or run.branch_errors
                else "done"
            )
            run.leaves_consumed = result.leaves_consumed
            run.leaves_persisted = result.leaves_persisted
            run.leaves_failed = result.leaves_failed
            run.errors = result.errors
        except ExpansionNotReadyError:
            run.status = "not_ready"
        except Exception as exc:
            run.status = "failed"
            run.errors = (str(exc),)
        finally:
            run.finished_at = datetime.now(tz=timezone.utc)
            if isinstance(repo, RunAudit):
                repo.record_run(run)

How it works

This adapter implements the Ladon SES (Source / Expander / Sink) protocol against the HN Firebase API.

Pipeline

flowchart TB
    subgraph plugin ["SES Plugin"]
        direction LR
        SRC["HNSource\ndiscover()"] -- "Ref × N" --> EXP["HNExpander\nexpand()"] -- "StoryRecord\nRef × M" --> SNK["HNSink\nconsume()"]
    end
    subgraph persistence ["Persistence"]
        direction LR
        REPO["HNDuckDBRepository"] --> DB[("hn.db")] -- "export_parquet()" --> PQ[("hn.parquet")]
    end
    SNK -- "CommentRecord" --> REPO

Domain records

flowchart LR
    SR["StoryRecord\n─────────────\nid · title · url\nby · score · time\ndescendants · comment_ids"]
    CR["CommentRecord\n─────────────\nid · story_id · parent_id\nby · text · time"]

    SR -- "expands into" --> CR

SES class map

Layer Class HN API call
Source HNSource GET /v0/topstories.json → story ID list
Expander HNExpander GET /v0/item/{story_id}.json → comment refs
Sink HNSink GET /v0/item/{comment_id}.jsonCommentRecord

HNDuckDBRepository implements both Repository (leaf persistence) and RunAudit (run history) from ladon.persistence — structurally, with no Ladon base class imported.

Each story is one independent run. The run audit trail lets you resume from the last successful crawl:

last = repo.get_last_run("hackernews")  # most recent "done" run

Writing your own adapter

ladon-hackernews is the canonical reference for building a Ladon adapter. See the Ladon documentation and ADR-003 for the full adapter authoring guide.

Key pattern: adapters implement Ladon protocols structurally — no inheritance from any Ladon base class is required. Only RunRecord needs to be imported for RunAudit implementations.

Development

git clone https://github.com/MoonyFringers/ladon-hackernews
cd ladon-hackernews
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

See CONTRIBUTING.md for full guidelines.

License

Apache-2.0 — see LICENSE.

The Ladon core framework is AGPL-3.0-only. ladon-hackernews is Apache-2.0 but has a runtime dependency on Ladon core; review the AGPL terms if you plan to distribute or run this as a network service.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ladon_hackernews-0.0.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ladon_hackernews-0.0.1-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file ladon_hackernews-0.0.1.tar.gz.

File metadata

  • Download URL: ladon_hackernews-0.0.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_hackernews-0.0.1.tar.gz
Algorithm Hash digest
SHA256 76e85b039a2424a6e0fe52317e2ae4a83c7d9c79220158965d80edaad293b7df
MD5 2cf47a71ab981fce99960a6327ca5b9d
BLAKE2b-256 df54ff12abd6d94f567134dc299f06c87657b7b656200a80a901ee843685ae41

See more details on using hashes here.

File details

Details for the file ladon_hackernews-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ladon_hackernews-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f724e943eac6e16e54eafb70523abef35c893c9262ed696250033ae7d798ab90
MD5 b6228e7d400d06cf597ba11facf80b16
BLAKE2b-256 e7de846652e83c5c4a0b0b94c8682c287e4d4508d87876e8dc154993cde7cccb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page