Utility that converts Wikipedia pages into GitHub-flavored Markdown.
Project description
GoodWiki
GoodWiki is a Python package that carefully converts Wikipedia pages into GitHub-flavored Markdown. Converted pages preserve layout features like lists, code blocks, math, and block quotes.
This package is used to generate the GoodWiki Dataset.
Installation
This package supports Python 3.11+.
- Install via pip.
pip install goodwiki
- Install pandoc v2.19.2. Follow instructions here.
Usage
Initializing Client
import asyncio
from goodwiki import GoodwikiClient
client = GoodwikiClient()
You can also optionally provide your own user agent (default is goodwiki/1.0 (https://euirim.org)
):
client = GoodwikiClient("goodwiki/1.0 (bob@gmail.com)")
Getting Single Page
page = asyncio.run(client.get_page("Usain Bolt"))
You can also optionally include styling syntax like bolding to the final markdown:
page = asyncio.run(client.get_page("Usain Bolt", with_styling=True))
You can access the resulting data via properties. For example:
print(page.markdown)
Getting Category Pages
To get a list of page titles associated with a Wikipedia category, run the following:
client.get_category_pages("Category:Good_articles")
Converting Existing Raw Wikitext
If you've already downloaded raw wikitext from Wikipedia, you can convert it to Markdown by running:
client.get_page_from_wikitext(
raw_wikitext="RAW_WIKITEXT",
# The rest of the fields are meant for populating the final WikiPage object
title="Usain Bolt",
pageid=123,
revid=123,
)
Methodology
Full details are available in this package's GitHub repo README.
External Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.