Async Wikipedia adapter for Ladon — mathematical finance corpus for LLM fine-tuning.
Project description
ladon-mimir
Async Wikipedia adapter for the Ladon crawler framework.
Crawls a Wikipedia category tree via async BFS, fetches full article text through the MediaWiki API, and persists everything to a DuckDB database — ready to export as Parquet for LLM fine-tuning pipelines or downstream analysis.
Built as a first-party reference adapter for Category:Mathematical_finance,
but works with any Wikipedia category.
Quick start
pip install ladon-mimir
ladon-mimir --category "Mathematical finance" --out mimir.db
No authentication. No external server. Wikipedia's API is public.
Re-running against the same --out file resumes automatically — already-stored
article page IDs are skipped.
What you get
Each run writes two tables to mimir.db:
mimir_articles — one row per article (upserted on page_id):
| column | type | description |
|---|---|---|
run_id |
TEXT | UUID of the crawl run that last wrote this row |
page_id |
INTEGER | Wikipedia page ID (primary key) |
title |
TEXT | Article title |
summary |
TEXT | First paragraph of the article |
full_text |
TEXT | Full article text (extract) |
categories |
TEXT | JSON array of category names |
last_modified |
TIMESTAMPTZ | Last edit timestamp (UTC) |
word_count |
INTEGER | Word count of full text |
url |
TEXT | Canonical Wikipedia URL |
ladon_runs — one row per crawl run:
| column | type | description |
|---|---|---|
run_id |
TEXT | UUID for this crawl run |
category |
TEXT | Root category name |
started_at |
TIMESTAMPTZ | When the run started (UTC) |
finished_at |
TIMESTAMPTZ | When the run finished; NULL while running |
status |
TEXT | running, done, or failed |
articles_fetched |
INTEGER | Articles successfully saved |
articles_failed |
INTEGER | Articles that failed to fetch or parse |
Sample DuckDB query
-- Longest articles in the corpus
SELECT title, word_count, url
FROM mimir_articles
ORDER BY word_count DESC
LIMIT 10;
Export to Parquet
ladon-mimir --category "Mathematical finance" --out mimir.db --sync
Or from Python:
from ladon_mimir import export_parquet
count = export_parquet("mimir.db", "mimir.parquet")
print(f"Exported {count} articles")
Note: The
categoriescolumn is exported as a JSON-encodedVARCHAR. Parse it withjson.loadsor DuckDB'sjson_extract/json_array_elements.
CLI reference
ladon-mimir --category NAME [options]
| flag | default | description |
|---|---|---|
--category NAME |
required | Wikipedia category name without the Category: prefix |
--out PATH |
mimir.db |
Output DuckDB database path |
--concurrency N |
10 |
Maximum concurrent article fetches |
--depth N |
2 |
BFS depth for sub-category traversal |
--limit N |
0 (unlimited) |
Maximum articles to fetch |
--exclude-category NAME |
— | Sub-category to prune from BFS (repeatable) |
--sync |
off | Export to <out>.parquet after crawl |
--dry-run |
off | Print what would be done without crawling |
--verbose, -v |
off | Show DEBUG-level framework messages |
Examples
# Crawl with depth 3, 5 concurrent fetches, cap at 500 articles
ladon-mimir --category "Mathematical finance" --depth 3 --concurrency 5 --limit 500
# Crawl and immediately export to Parquet
ladon-mimir --category "Mathematical finance" --out mimir.db --sync
# Exclude noisy sub-categories
ladon-mimir --category "Mathematical finance" \
--exclude-category "Stubs" \
--exclude-category "Mathematical finance stubs"
# Preview what would run without touching the network
ladon-mimir --category "Mathematical finance" --dry-run
Use as a library
import asyncio
from ladon import async_run_crawl
from ladon.networking import AsyncHttpClient
from ladon.networking.config import HttpClientConfig
from ladon.runner import RunConfig
from ladon_mimir import MimirPlugin, export_parquet
from ladon_mimir.models import ArticleRecord, CategoryRecord
from ladon_mimir.repository import MimirRepository
async def crawl(category: str, db_path: str) -> None:
config = RunConfig(async_concurrency=10)
client_config = HttpClientConfig(
user_agent="my-bot/1.0",
min_request_interval_seconds=0.2,
)
with MimirRepository(db_path) as repo:
existing_ids = repo.get_existing_page_ids()
repo.start_run(category)
plugin = MimirPlugin(
category=category,
max_depth=2,
skip_page_ids=existing_ids, # resume: skip already-stored articles
)
async def on_leaf(record: object, parent: object) -> None:
if isinstance(record, ArticleRecord) and isinstance(parent, CategoryRecord):
repo.save_article(record, parent)
async with AsyncHttpClient(client_config) as client:
root_refs = await plugin.source.discover(client)
result = await async_run_crawl(
root_refs[0], plugin, client, config, on_leaf=on_leaf
)
repo.finish_run("done")
print(f"Saved {result.leaves_persisted} articles, {result.leaves_failed} failed")
asyncio.run(crawl("Mathematical finance", "mimir.db"))
export_parquet("mimir.db", "mimir.parquet")
How it works
ladon-mimir implements the Ladon SES (Source / Expander / Sink) protocol against the MediaWiki Action API.
Pipeline
flowchart TB
subgraph plugin ["Async SES Plugin"]
direction LR
SRC["WikiCategorySource\ndiscover()"] -- "CategoryRef × 1" --> EXP["WikiCategoryExpander\nexpand() BFS"] -- "CategoryRecord\nArticleRef × N" --> SNK["WikiArticleSink\nconsume()"]
end
subgraph persistence ["Persistence"]
direction LR
REPO["MimirRepository"] --> DB[("mimir.db")] -- "export_parquet()" --> PQ[("mimir.parquet")]
end
SNK -- "ArticleRecord" --> REPO
SES class map
| Layer | Class | MediaWiki API call |
|---|---|---|
Source |
WikiCategorySource |
— (returns the root CategoryRef directly) |
Expander |
WikiCategoryExpander |
action=query&list=categorymembers (BFS, paginated) |
Sink |
WikiArticleSink |
action=query&prop=extracts|categories|info |
The expander performs async BFS with asyncio.gather at each depth level —
all sub-categories at a given depth are fetched concurrently. Article refs are
deduplicated by page_id across the entire traversal; already-stored IDs are
skipped for resume.
Development
git clone https://github.com/MoonyFringers/ladon-mimir
cd ladon-mimir
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v
License
Apache-2.0 — see LICENSE.
The Ladon core framework is
AGPL-3.0-only. ladon-mimir is Apache-2.0 but has a runtime dependency on
Ladon core; review the AGPL terms if you plan to distribute or run this as a
network service.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ladon_mimir-0.1.0.tar.gz.
File metadata
- Download URL: ladon_mimir-0.1.0.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95c97ce572e1a5d5a916cb487d5c0e84e8a1e06f03a0b04323e7601708b8a268
|
|
| MD5 |
520d43d6ad7ed332a25091e27008ca68
|
|
| BLAKE2b-256 |
8b44f158b736e22ac7da0116079a1b612c7b075883efd5e0f5e3cd817dd1b898
|
File details
Details for the file ladon_mimir-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ladon_mimir-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e69119a5cdb968a2b6d84a086bbec63038fedf0a7f2066236fe4829b6e074f53
|
|
| MD5 |
f13370a975fb0eefbd6c4cda46ebe808
|
|
| BLAKE2b-256 |
7b721da08d2cc1c4a49521ea1e064b416e9bdf78bb031125b82b3e33986c78c7
|