Hacker News adapter for the Ladon crawler framework.

These details have not been verified by PyPI

Project links

Project description

ladon-hackernews

Hacker News adapter for the Ladon crawler framework.

Crawls the HN top-stories list, expands each story into its direct comments, and persists everything to a DuckDB database — ready to export as Parquet for LLM training pipelines or downstream analysis.

Quick start

pip install ladon-hackernews
ladon-hackernews --top 30 --out hn.db

No authentication. No external server.

What you get

Each run writes two tables to hn.db:

hn_comments — one row per HN comment (the leaf record):

column	type	description
`id`	INTEGER	HN comment ID
`story_id`	INTEGER	Parent story ID
`parent_id`	INTEGER	Immediate parent (story or comment)
`by`	TEXT	Author username
`text`	TEXT	Raw HTML comment body
`time`	TIMESTAMPTZ	Comment timestamp (UTC)
`run_id`	TEXT	UUID of the crawl run that wrote this row

ladon_runs — one row per story crawled (upserted twice: at start and finish):

column	type	description
`run_id`	TEXT	UUID for this crawl run
`plugin_name`	TEXT	Always `"hackernews"`
`top_ref`	TEXT	HN item URL that was the root of this run
`started_at`	TIMESTAMPTZ	When the run started (UTC)
`finished_at`	TIMESTAMPTZ	When the run finished; NULL while running
`status`	TEXT	`done`, `partial`, `not_ready`, `failed`, or `running`
`leaves_consumed`	INTEGER	Comments for which `sink.consume()` succeeded
`leaves_persisted`	INTEGER	Comments successfully written to `hn_comments`
`leaves_failed`	INTEGER	Comments that failed to fetch or persist
`branch_errors`	INTEGER	Expander-level errors (branch could not be expanded)
`errors`	TEXT	JSON array of error message strings

Sample DuckDB query

-- Top commenters across all crawled stories
SELECT "by", COUNT(*) AS comments
FROM hn_comments
GROUP BY "by"
ORDER BY comments DESC
LIMIT 10;

HN comments are structured, human-authored, and high signal-to-noise — a useful corpus for instruction tuning and dialogue modelling. A typical pipeline looks like:

ladon-hackernews --top 500 --out hn.db
    → export_parquet("hn.db", "hn.parquet")
        → training pipeline

Export to Parquet

from ladon_hackernews import export_parquet

export_parquet("hn.db", "hn.parquet")

CLI reference

ladon-hackernews [--top N] [--out PATH] [--verbose]

flag	default	description
`--top N`	`30`	Number of top stories to crawl (range: 1–500)
`--out PATH`	`hn.db`	Output DuckDB database path
`--verbose`, `-v`	off	Show DEBUG-level framework messages (leaf warnings, HTTP timings)

In default mode the terminal shows one progress line per story and a summary at the end. Framework-level noise (leaf unavailable, expander branch failed) is suppressed; partial stories print a ↳ hint instead. Pass --verbose to expose the raw framework log messages.

Use as a library

The CLI is the simplest way to run a crawl, but HNPlugin and HNDuckDBRepository can be used directly as a library — useful when you need custom scheduling, batching, or integration into a larger pipeline.

import uuid
from datetime import datetime, timezone

from ladon.networking.client import HttpClient
from ladon.networking.config import HttpClientConfig
from ladon.persistence import RunAudit, RunRecord
from ladon.plugins.errors import ExpansionNotReadyError
from ladon.plugins.models import Ref
from ladon.runner import RunConfig, run_crawl
from ladon_hackernews import HNPlugin, HNDuckDBRepository

plugin = HNPlugin(top=10)
config = RunConfig()
client_config = HttpClientConfig(user_agent="my-bot/1.0")

with HNDuckDBRepository("hn.db") as repo, HttpClient(client_config) as client:
    for story_ref in plugin.source.discover(client):
        if not isinstance(story_ref, Ref):
            raise TypeError(f"unexpected type {type(story_ref).__name__}")
        run_id = str(uuid.uuid4())
        run = RunRecord(
            run_id=run_id,
            plugin_name=plugin.name,
            top_ref=story_ref.url,
            started_at=datetime.now(tz=timezone.utc),
            status="running",
        )
        if isinstance(repo, RunAudit):
            repo.record_run(run)

        try:
            result = run_crawl(
                story_ref, plugin, client, config,
                # Default-argument capture binds run_id to each lambda.
                on_leaf=lambda rec, _, _id=run_id: repo.write_leaf(rec, _id),
            )
            run.branch_errors = sum(
                1 for e in result.errors if e.startswith("expander branch")
            )
            run.status = (
                "partial"
                if result.leaves_failed
                or result.leaves_consumed > result.leaves_persisted
                or run.branch_errors
                else "done"
            )
            run.leaves_consumed = result.leaves_consumed
            run.leaves_persisted = result.leaves_persisted
            run.leaves_failed = result.leaves_failed
            run.errors = result.errors
        except ExpansionNotReadyError:
            run.status = "not_ready"
        except Exception as exc:
            run.status = "failed"
            run.errors = (str(exc),)
        finally:
            run.finished_at = datetime.now(tz=timezone.utc)
            if isinstance(repo, RunAudit):
                repo.record_run(run)

How it works

This adapter implements the Ladon SES (Source / Expander / Sink) protocol against the HN Firebase API.

Pipeline

flowchart TB
    subgraph plugin ["SES Plugin"]
        direction LR
        SRC["HNSource\ndiscover()"] -- "Ref × N" --> EXP["HNExpander\nexpand()"] -- "StoryRecord\nRef × M" --> SNK["HNSink\nconsume()"]
    end
    subgraph persistence ["Persistence"]
        direction LR
        REPO["HNDuckDBRepository"] --> DB[("hn.db")] -- "export_parquet()" --> PQ[("hn.parquet")]
    end
    SNK -- "CommentRecord" --> REPO

Domain records

flowchart LR
    SR["StoryRecord\n─────────────\nid · title · url\nby · score · time\ndescendants · comment_ids"]
    CR["CommentRecord\n─────────────\nid · story_id · parent_id\nby · text · time"]

    SR -- "expands into" --> CR

SES class map

Layer	Class	HN API call
`Source`	`HNSource`	`GET /v0/topstories.json` → story ID list
`Expander`	`HNExpander`	`GET /v0/item/{story_id}.json` → comment refs
`Sink`	`HNSink`	`GET /v0/item/{comment_id}.json` → `CommentRecord`

HNDuckDBRepository implements both Repository (leaf persistence) and RunAudit (run history) from ladon.persistence — structurally, with no Ladon base class imported.

Each story is one independent run. The run audit trail lets you resume from the last successful crawl:

last = repo.get_last_run("hackernews")  # most recent "done" run

Writing your own adapter

ladon-hackernews is the canonical reference for building a Ladon adapter. See the Ladon documentation and ADR-003 for the full adapter authoring guide.

Key pattern: adapters implement Ladon protocols structurally — no inheritance from any Ladon base class is required. Only RunRecord needs to be imported for RunAudit implementations.

Development

git clone https://github.com/MoonyFringers/ladon-hackernews
cd ladon-hackernews
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

See CONTRIBUTING.md for full guidelines.

License

Apache-2.0 — see LICENSE.

The Ladon core framework is AGPL-3.0-only. ladon-hackernews is Apache-2.0 but has a runtime dependency on Ladon core; review the AGPL terms if you plan to distribute or run this as a network service.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ladon_hackernews-0.0.1.tar.gz (28.4 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ladon_hackernews-0.0.1-py3-none-any.whl (21.5 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file ladon_hackernews-0.0.1.tar.gz.

File metadata

Download URL: ladon_hackernews-0.0.1.tar.gz
Upload date: Apr 17, 2026
Size: 28.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_hackernews-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`76e85b039a2424a6e0fe52317e2ae4a83c7d9c79220158965d80edaad293b7df`
MD5	`2cf47a71ab981fce99960a6327ca5b9d`
BLAKE2b-256	`df54ff12abd6d94f567134dc299f06c87657b7b656200a80a901ee843685ae41`

See more details on using hashes here.

File details

Details for the file ladon_hackernews-0.0.1-py3-none-any.whl.

File metadata

Download URL: ladon_hackernews-0.0.1-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_hackernews-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f724e943eac6e16e54eafb70523abef35c893c9262ed696250033ae7d798ab90`
MD5	`b6228e7d400d06cf597ba11facf80b16`
BLAKE2b-256	`e7de846652e83c5c4a0b0b94c8682c287e4d4508d87876e8dc154993cde7cccb`

See more details on using hashes here.

ladon-hackernews 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ladon-hackernews

Quick start

What you get

Sample DuckDB query

Export to Parquet

CLI reference

Use as a library

How it works

Pipeline

Domain records

SES class map

Writing your own adapter

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes