Skip to main content

Async Wikipedia adapter for Ladon — mathematical finance corpus for LLM fine-tuning.

Project description

ladon-mimir

CI License: Apache 2.0 Python 3.11+

Async Wikipedia adapter for the Ladon crawler framework.

Crawls a Wikipedia category tree via async BFS, fetches full article text through the MediaWiki API, and persists everything to a DuckDB database — ready to export as Parquet for LLM fine-tuning pipelines or downstream analysis.

Built as a first-party reference adapter for Category:Mathematical_finance, but works with any Wikipedia category.

Quick start

pip install ladon-mimir
ladon-mimir --category "Mathematical finance" --out mimir.db

No authentication. No external server. Wikipedia's API is public.

Re-running against the same --out file resumes automatically — already-stored article page IDs are skipped.

What you get

Each run writes two tables to mimir.db:

mimir_articles — one row per article (upserted on page_id):

column type description
run_id TEXT UUID of the crawl run that last wrote this row
page_id INTEGER Wikipedia page ID (primary key)
title TEXT Article title
summary TEXT First paragraph of the article
full_text TEXT Full article text (extract)
categories TEXT JSON array of category names
last_modified TIMESTAMPTZ Last edit timestamp (UTC)
word_count INTEGER Word count of full text
url TEXT Canonical Wikipedia URL

ladon_runs — one row per crawl run:

column type description
run_id TEXT UUID for this crawl run
category TEXT Root category name
started_at TIMESTAMPTZ When the run started (UTC)
finished_at TIMESTAMPTZ When the run finished; NULL while running
status TEXT running, done, or failed
articles_fetched INTEGER Articles successfully saved
articles_failed INTEGER Articles that failed to fetch or parse

Sample DuckDB query

-- Longest articles in the corpus
SELECT title, word_count, url
FROM mimir_articles
ORDER BY word_count DESC
LIMIT 10;

Export to Parquet

ladon-mimir --category "Mathematical finance" --out mimir.db --sync

Or from Python:

from ladon_mimir import export_parquet

count = export_parquet("mimir.db", "mimir.parquet")
print(f"Exported {count} articles")

Note: The categories column is exported as a JSON-encoded VARCHAR. Parse it with json.loads or DuckDB's json_extract / json_array_elements.

CLI reference

ladon-mimir --category NAME [options]
flag default description
--category NAME required Wikipedia category name without the Category: prefix
--out PATH mimir.db Output DuckDB database path
--concurrency N 10 Maximum concurrent article fetches
--depth N 2 BFS depth for sub-category traversal
--limit N 0 (unlimited) Maximum articles to fetch
--exclude-category NAME Sub-category to prune from BFS (repeatable)
--sync off Export to <out>.parquet after crawl
--dry-run off Print what would be done without crawling
--verbose, -v off Show DEBUG-level framework messages

Examples

# Crawl with depth 3, 5 concurrent fetches, cap at 500 articles
ladon-mimir --category "Mathematical finance" --depth 3 --concurrency 5 --limit 500

# Crawl and immediately export to Parquet
ladon-mimir --category "Mathematical finance" --out mimir.db --sync

# Exclude noisy sub-categories
ladon-mimir --category "Mathematical finance" \
    --exclude-category "Stubs" \
    --exclude-category "Mathematical finance stubs"

# Preview what would run without touching the network
ladon-mimir --category "Mathematical finance" --dry-run

Use as a library

import asyncio
from ladon import async_run_crawl
from ladon.networking import AsyncHttpClient
from ladon.networking.config import HttpClientConfig
from ladon.runner import RunConfig

from ladon_mimir import MimirPlugin, export_parquet
from ladon_mimir.models import ArticleRecord, CategoryRecord
from ladon_mimir.repository import MimirRepository

async def crawl(category: str, db_path: str) -> None:
    config = RunConfig(async_concurrency=10)
    client_config = HttpClientConfig(
        user_agent="my-bot/1.0",
        min_request_interval_seconds=0.2,
    )

    with MimirRepository(db_path) as repo:
        existing_ids = repo.get_existing_page_ids()
        repo.start_run(category)

        plugin = MimirPlugin(
            category=category,
            max_depth=2,
            skip_page_ids=existing_ids,  # resume: skip already-stored articles
        )

        async def on_leaf(record: object, parent: object) -> None:
            if isinstance(record, ArticleRecord) and isinstance(parent, CategoryRecord):
                repo.save_article(record, parent)

        async with AsyncHttpClient(client_config) as client:
            root_refs = await plugin.source.discover(client)
            result = await async_run_crawl(
                root_refs[0], plugin, client, config, on_leaf=on_leaf
            )
            repo.finish_run("done")

        print(f"Saved {result.leaves_persisted} articles, {result.leaves_failed} failed")

asyncio.run(crawl("Mathematical finance", "mimir.db"))
export_parquet("mimir.db", "mimir.parquet")

How it works

ladon-mimir implements the Ladon SES (Source / Expander / Sink) protocol against the MediaWiki Action API.

Pipeline

flowchart TB
    subgraph plugin ["Async SES Plugin"]
        direction LR
        SRC["WikiCategorySource\ndiscover()"] -- "CategoryRef × 1" --> EXP["WikiCategoryExpander\nexpand() BFS"] -- "CategoryRecord\nArticleRef × N" --> SNK["WikiArticleSink\nconsume()"]
    end
    subgraph persistence ["Persistence"]
        direction LR
        REPO["MimirRepository"] --> DB[("mimir.db")] -- "export_parquet()" --> PQ[("mimir.parquet")]
    end
    SNK -- "ArticleRecord" --> REPO

SES class map

Layer Class MediaWiki API call
Source WikiCategorySource — (returns the root CategoryRef directly)
Expander WikiCategoryExpander action=query&list=categorymembers (BFS, paginated)
Sink WikiArticleSink action=query&prop=extracts|categories|info

The expander performs async BFS with asyncio.gather at each depth level — all sub-categories at a given depth are fetched concurrently. Article refs are deduplicated by page_id across the entire traversal; already-stored IDs are skipped for resume.

Development

git clone https://github.com/MoonyFringers/ladon-mimir
cd ladon-mimir
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

License

Apache-2.0 — see LICENSE.

The Ladon core framework is AGPL-3.0-only. ladon-mimir is Apache-2.0 but has a runtime dependency on Ladon core; review the AGPL terms if you plan to distribute or run this as a network service.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ladon_mimir-0.2.0.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ladon_mimir-0.2.0-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file ladon_mimir-0.2.0.tar.gz.

File metadata

  • Download URL: ladon_mimir-0.2.0.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_mimir-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ac682dce7efe4ffd35fcbab13d3ee69d70fa36c9734a16c1b4de01b8d820e92b
MD5 984033261219d93a4ead89558a8cf351
BLAKE2b-256 ad4711beb32f2aaaeee3d7a4c300f6e08ada3a7a911bf1da888a2d97503578a4

See more details on using hashes here.

File details

Details for the file ladon_mimir-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ladon_mimir-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_mimir-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b1cf1fc78ffb82c98f52dd2ebf714769c9d9d5ab0c996b300f8eec71fbbbd315
MD5 e371805d0d64024111de75fd5f56506b
BLAKE2b-256 2c53d31d0f62be5957923f876bdf42f31d1b99edf7db73cf11183042be2c4330

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page