Skip to main content

Async Wikipedia adapter for Ladon — mathematical finance corpus for LLM fine-tuning.

Project description

ladon-mimir

CI License: Apache 2.0 Python 3.11+

Async Wikipedia adapter for the Ladon crawler framework.

Crawls a Wikipedia category tree via async BFS, fetches full article text through the MediaWiki API, and persists everything to a DuckDB database — ready to export as Parquet for LLM fine-tuning pipelines or downstream analysis.

Built as a first-party reference adapter for Category:Mathematical_finance, but works with any Wikipedia category.

Quick start

pip install ladon-mimir
ladon-mimir --category "Mathematical finance" --out mimir.db

No authentication. No external server. Wikipedia's API is public.

Re-running against the same --out file resumes automatically — already-stored article page IDs are skipped.

What you get

Each run writes two tables to mimir.db:

mimir_articles — one row per article (upserted on page_id):

column type description
run_id TEXT UUID of the crawl run that last wrote this row
page_id INTEGER Wikipedia page ID (primary key)
title TEXT Article title
summary TEXT First paragraph of the article
full_text TEXT Full article text (extract)
categories TEXT JSON array of category names
last_modified TIMESTAMPTZ Last edit timestamp (UTC)
word_count INTEGER Word count of full text
url TEXT Canonical Wikipedia URL

ladon_runs — one row per crawl run:

column type description
run_id TEXT UUID for this crawl run
category TEXT Root category name
started_at TIMESTAMPTZ When the run started (UTC)
finished_at TIMESTAMPTZ When the run finished; NULL while running
status TEXT running, done, or failed
articles_fetched INTEGER Articles successfully saved
articles_failed INTEGER Articles that failed to fetch or parse

Sample DuckDB query

-- Longest articles in the corpus
SELECT title, word_count, url
FROM mimir_articles
ORDER BY word_count DESC
LIMIT 10;

Export to Parquet

ladon-mimir --category "Mathematical finance" --out mimir.db --sync

Or from Python:

from ladon_mimir import export_parquet

count = export_parquet("mimir.db", "mimir.parquet")
print(f"Exported {count} articles")

Note: The categories column is exported as a JSON-encoded VARCHAR. Parse it with json.loads or DuckDB's json_extract / json_array_elements.

CLI reference

ladon-mimir --category NAME [options]
flag default description
--category NAME required Wikipedia category name without the Category: prefix
--out PATH mimir.db Output DuckDB database path
--concurrency N 10 Maximum concurrent article fetches
--depth N 2 BFS depth for sub-category traversal
--limit N 0 (unlimited) Maximum articles to fetch
--exclude-category NAME Sub-category to prune from BFS (repeatable)
--sync off Export to <out>.parquet after crawl
--dry-run off Print what would be done without crawling
--verbose, -v off Show DEBUG-level framework messages

Examples

# Crawl with depth 3, 5 concurrent fetches, cap at 500 articles
ladon-mimir --category "Mathematical finance" --depth 3 --concurrency 5 --limit 500

# Crawl and immediately export to Parquet
ladon-mimir --category "Mathematical finance" --out mimir.db --sync

# Exclude noisy sub-categories
ladon-mimir --category "Mathematical finance" \
    --exclude-category "Stubs" \
    --exclude-category "Mathematical finance stubs"

# Preview what would run without touching the network
ladon-mimir --category "Mathematical finance" --dry-run

Use as a library

import asyncio
from ladon import async_run_crawl
from ladon.networking import AsyncHttpClient
from ladon.networking.config import HttpClientConfig
from ladon.runner import RunConfig

from ladon_mimir import MimirPlugin, export_parquet
from ladon_mimir.models import ArticleRecord, CategoryRecord
from ladon_mimir.repository import MimirRepository

async def crawl(category: str, db_path: str) -> None:
    config = RunConfig(async_concurrency=10)
    client_config = HttpClientConfig(
        user_agent="my-bot/1.0",
        min_request_interval_seconds=0.2,
    )

    with MimirRepository(db_path) as repo:
        existing_ids = repo.get_existing_page_ids()
        repo.start_run(category)

        plugin = MimirPlugin(
            category=category,
            max_depth=2,
            skip_page_ids=existing_ids,  # resume: skip already-stored articles
        )

        async def on_leaf(record: object, parent: object) -> None:
            if isinstance(record, ArticleRecord) and isinstance(parent, CategoryRecord):
                repo.save_article(record, parent)

        async with AsyncHttpClient(client_config) as client:
            root_refs = await plugin.source.discover(client)
            result = await async_run_crawl(
                root_refs[0], plugin, client, config, on_leaf=on_leaf
            )
            repo.finish_run("done")

        print(f"Saved {result.leaves_persisted} articles, {result.leaves_failed} failed")

asyncio.run(crawl("Mathematical finance", "mimir.db"))
export_parquet("mimir.db", "mimir.parquet")

How it works

ladon-mimir implements the Ladon SES (Source / Expander / Sink) protocol against the MediaWiki Action API.

Pipeline

flowchart TB
    subgraph plugin ["Async SES Plugin"]
        direction LR
        SRC["WikiCategorySource\ndiscover()"] -- "CategoryRef × 1" --> EXP["WikiCategoryExpander\nexpand() BFS"] -- "CategoryRecord\nArticleRef × N" --> SNK["WikiArticleSink\nconsume()"]
    end
    subgraph persistence ["Persistence"]
        direction LR
        REPO["MimirRepository"] --> DB[("mimir.db")] -- "export_parquet()" --> PQ[("mimir.parquet")]
    end
    SNK -- "ArticleRecord" --> REPO

SES class map

Layer Class MediaWiki API call
Source WikiCategorySource — (returns the root CategoryRef directly)
Expander WikiCategoryExpander action=query&list=categorymembers (BFS, paginated)
Sink WikiArticleSink action=query&prop=extracts|categories|info

The expander performs async BFS with asyncio.gather at each depth level — all sub-categories at a given depth are fetched concurrently. Article refs are deduplicated by page_id across the entire traversal; already-stored IDs are skipped for resume.

Development

git clone https://github.com/MoonyFringers/ladon-mimir
cd ladon-mimir
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

License

Apache-2.0 — see LICENSE.

The Ladon core framework is AGPL-3.0-only. ladon-mimir is Apache-2.0 but has a runtime dependency on Ladon core; review the AGPL terms if you plan to distribute or run this as a network service.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ladon_mimir-0.1.0.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ladon_mimir-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file ladon_mimir-0.1.0.tar.gz.

File metadata

  • Download URL: ladon_mimir-0.1.0.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_mimir-0.1.0.tar.gz
Algorithm Hash digest
SHA256 95c97ce572e1a5d5a916cb487d5c0e84e8a1e06f03a0b04323e7601708b8a268
MD5 520d43d6ad7ed332a25091e27008ca68
BLAKE2b-256 8b44f158b736e22ac7da0116079a1b612c7b075883efd5e0f5e3cd817dd1b898

See more details on using hashes here.

File details

Details for the file ladon_mimir-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ladon_mimir-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ladon_mimir-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e69119a5cdb968a2b6d84a086bbec63038fedf0a7f2066236fe4829b6e074f53
MD5 f13370a975fb0eefbd6c4cda46ebe808
BLAKE2b-256 7b721da08d2cc1c4a49521ea1e064b416e9bdf78bb031125b82b3e33986c78c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page