Skip to main content

Utility that converts Wikipedia pages into GitHub-flavored Markdown.

Project description

GoodWiki

GoodWiki is a Python package that carefully converts Wikipedia pages into GitHub-flavored Markdown. Converted pages preserve layout features like lists, code blocks, math, and block quotes.

This package is used to generate the GoodWiki Dataset.

Installation

This package supports Python 3.11+.

  1. Install via pip.
pip install goodwiki
  1. Install pandoc v2.19.2. Follow instructions here.

Usage

Initializing Client

import asyncio
from goodwiki import GoodwikiClient

client = GoodwikiClient()

You can also optionally provide your own user agent (default is goodwiki/1.0 (https://euirim.org)):

client = GoodwikiClient("goodwiki/1.0 (bob@gmail.com)")

Getting Single Page

page = asyncio.run(client.get_page("Usain Bolt"))

You can also optionally include styling syntax like bolding to the final markdown:

page = asyncio.run(client.get_page("Usain Bolt", with_styling=True))

You can access the resulting data via properties. For example:

print(page.markdown)

Getting Category Pages

To get a list of page titles associated with a Wikipedia category, run the following:

client.get_category_pages("Category:Good_articles")

Converting Existing Raw Wikitext

If you've already downloaded raw wikitext from Wikipedia, you can convert it to Markdown by running:

client.get_page_from_wikitext(
	raw_wikitext="RAW_WIKITEXT",
	# The rest of the fields are meant for populating the final WikiPage object
	title="Usain Bolt",
	pageid=123,
	revid=123,
)

Methodology

Full details are available in this package's GitHub repo README.

External Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodwiki-1.0.1.tar.gz (31.5 kB view hashes)

Uploaded Source

Built Distribution

goodwiki-1.0.1-py3-none-any.whl (15.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page