mediawiki parser and utility library

These details have not been verified by PyPI

Project links

homepage

Project description

mwlib - MediaWiki Parser and Utility Library

Overview

mwlib is a versatile library designed for parsing MediaWiki articles and converting them to various output formats. A notable application of mwlib is in Wikipedia's "Print/export" feature, where it is used to create PDF documents from Wikipedia articles.

Getting Started

Prerequisites

To build mwlib, ensure you have the following software installed:

Python (version 3.11 or 3.12)
Ploticus
re2c
Perl
Pillow / PyImage
ImageMagick
uv (Python package installer, faster alternative to pip)

Setup a virtual environment for Python 3.11 or 3.12 and activate it.

Installing uv

If you don't have uv installed, you can install it following the instructions at uv's official documentation.

For example, on Unix-like systems:

curl -LsSf https://astral.sh/uv/install.sh | sh

Or using pip:

pip install uv

Installing mwlib

To install all dependencies and the project, run:

$ make install

This will use uv to install all required dependencies.

To build the C extensions and install mwlib in development mode:

$ make build
$ make develop

Documentation

Please visit http://mwlib.readthedocs.org/en/latest/index.html for detailed documentation.

Configuration

OAuth2 Configuration

mwlib supports OAuth2 client_credentials flow for Wikipedia API access. This allows authenticated access to MediaWiki APIs that require OAuth2 authentication, while maintaining compatibility with wikis that don't use OAuth2.

OAuth2 Configuration Options

The following configuration options are available for OAuth2:

oauth2.client_id: OAuth2 client ID
oauth2.client_secret: OAuth2 client secret
oauth2.token_url: URL for obtaining OAuth2 tokens (default: https://meta.wikimedia.org/w/rest.php/oauth2/access_token)
oauth2.enabled: Whether to use OAuth2 (default: False)

HTTP/2 Configuration Options

mwlib also supports HTTP/2 for improved performance:

http2.enabled: Whether to use HTTP/2 (default: True)
http2.auto_detect: Whether to auto-detect HTTP/2 support (default: True)

These configuration options can be set either through environment variables or in a configuration file (mwlib.ini or ~/.mwlibrc). The following table shows the mapping between configuration file options and their corresponding environment variables:

Config File Option	Environment Variable	Description
oauth2.client_id	MWLIB_OAUTH2_CLIENT_ID	OAuth2 client ID
oauth2.client_secret	MWLIB_OAUTH2_CLIENT_SECRET	OAuth2 client secret
oauth2.token_url	MWLIB_OAUTH2_TOKEN_URL	Token endpoint URL
oauth2.enabled	MWLIB_OAUTH2_ENABLED	Enable OAuth2 (true/false)
http2.enabled	MWLIB_HTTP2_ENABLED	Enable HTTP/2 (true/false)
http2.auto_detect	MWLIB_HTTP2_AUTO_DETECT	Auto-detect HTTP/2 support

Example configuration file (mwlib.ini):

[oauth2]
enabled=true
client_id=your-client-id
client_secret=your-client-secret

[http2]
auto_detect=true

Example Usage

You can also set the config parameters directly when instantiating MwApi:

from mwlib.network.sapi import MwApi

# Using OAuth2
api = MwApi(
    apiurl="https://en.wikipedia.org/w/api.php",
    use_oauth2=True
)

The recommended best practice, however, is to configure environment variables:

export MWLIB_OAUTH2_CLIENT_ID="your_client_id"
export MWLIB_OAUTH2_CLIENT_SECRET="your_client_secret"
export MWLIB_OAUTH2_ENABLED="True"

BigQuery Lookup for Image Description Pages

mwlib can use Google BigQuery to look up image description pages (namespace 6) instead of fetching them from the remote MediaWiki API. This significantly reduces the number of API requests and bypasses Wikipedia rate limits for configured domains.

The data originates from the Wikimedia Enterprise API snapshot endpoint, which provides Wikipedia page data as NDJSON files. These snapshots are loaded into a BigQuery table containing pre-extracted metadata: templates (used for license checking), image dimensions, content URLs, and license information.

How it works

During ZIP creation (mw-zip), image description page titles for configured domains (default: en.wikipedia.org) are batched and queried from BigQuery.
For pages found in BigQuery, templates are stored locally and an early license check is performed — images that fail the license filter are never downloaded.
Pages not found in BigQuery fall back to the remote MediaWiki API.
Non-configured domains (e.g., Commons) always use the remote API.

Prerequisites

The BigQuery client (and the wme-ingest CLI that loads snapshots into it) is shipped behind the bigquery extra so default installs stay slim. Install it via the extra rather than pinning google-cloud-bigquery directly:

# pip
pip install "mwlib[bigquery]"

# uv (project install)
uv pip install "mwlib[bigquery]"

# uv (developer checkout)
uv sync --extra bigquery

The bigquery extra is also pulled in by the dev dependency group, so uv sync (default groups) installs it for development and test runs.

Set up GCP authentication by pointing to a service account JSON file:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

BigQuery Configuration Options

Config File Option	Environment Variable	Default	Description
bigquery.enabled	MWLIB_BIGQUERY_ENABLED	false	Master switch to enable BigQuery lookups
bigquery.project	MWLIB_BIGQUERY_PROJECT	(required)	GCP project ID
bigquery.dataset	MWLIB_BIGQUERY_DATASET	wme_snapshots	BigQuery dataset name
bigquery.table	MWLIB_BIGQUERY_TABLE	file_pages	BigQuery table name
bigquery.timeout	MWLIB_BIGQUERY_TIMEOUT	30	Query timeout in seconds
bigquery.domains	MWLIB_BIGQUERY_DOMAINS	en.wikipedia.org	Comma-separated domains to route through BigQuery

Example environment variable configuration:

export MWLIB_BIGQUERY_ENABLED="true"
export MWLIB_BIGQUERY_PROJECT="my-gcp-project"
export MWLIB_BIGQUERY_DATASET="wikipedia"
export MWLIB_BIGQUERY_TABLE="file_pages"
export MWLIB_BIGQUERY_DOMAINS="en.wikipedia.org"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Or in a configuration file (mwlib.ini):

[bigquery]
enabled=true
project=my-gcp-project
dataset=wikipedia
table=file_pages
domains=en.wikipedia.org

When BigQuery is disabled or unavailable (missing credentials, network error, etc.), mwlib falls back to the remote API automatically with no change in behavior.

Loading Data into BigQuery

Use the wme-ingest command to download a Wikimedia Enterprise namespace 6 snapshot and load it into BigQuery:

# Full pipeline: download snapshot + load into BigQuery
wme-ingest --project my-gcp-project --dataset wikipedia

# Load from an already-downloaded tarball
wme-ingest -i /path/to/enwiki_ns6.tar.gz --project my-gcp-project --dataset wikipedia

# List available snapshots
wme-ingest --list

This requires Wikimedia Enterprise API credentials:

export WME_USERNAME="your-username"
export WME_PASSWORD="your-password"

Docker Compose Setup

For users interested in setting up mwlib using Docker Compose, detailed instructions are available at Docker Compose documentation.

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of PediaPress GmbH nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

.. _SpamBayes: http://spambayes.sourceforge.net/

Project details

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

This version

0.18.3

May 2, 2026

0.17.0.post1

Dec 15, 2023

0.17.0

Dec 13, 2023

0.16.2

Nov 29, 2019

0.16.1

Apr 24, 2018

0.15.19

Dec 15, 2017

0.15.14

Jan 13, 2014

0.15.13

Jan 9, 2014

0.15.12

Nov 11, 2013

0.15.11

Aug 9, 2013

0.15.10

Jul 4, 2013

0.15.9

Jul 2, 2013

0.15.8

May 13, 2013

0.15.7

Apr 23, 2013

0.15.6

Mar 27, 2013

0.15.5

Mar 27, 2013

0.15.4

Mar 26, 2013

0.15.3

Mar 13, 2013

0.15.2

Mar 13, 2013

0.15.1

Mar 12, 2013

0.15.0

Mar 12, 2013

0.14.3

Dec 4, 2012

0.14.2

Dec 3, 2012

0.14.1

Sep 24, 2012

0.14.0

Jul 21, 2012

0.13.11

Jul 18, 2012

0.13.10

Jul 16, 2012

0.13.9

Jul 16, 2012

0.13.8

Jun 11, 2012

0.13.7

May 8, 2012

0.13.6

Mar 7, 2012

0.13.5

Feb 29, 2012

0.13.4

Feb 14, 2012

0.13.3

Jan 12, 2012

0.13.2

Jan 11, 2012

0.13.1

Dec 13, 2011

0.12.17

Oct 24, 2011

0.12.16

Aug 31, 2011

0.12.15

Aug 12, 2011

0.12.14

Oct 29, 2010

0.12.13

Jul 16, 2010

0.12.12

Dec 16, 2009

0.12.11

Dec 8, 2009

0.12.10

Oct 20, 2009

0.12.9

Oct 12, 2009

0.12.8

Sep 25, 2009

0.12.7

Sep 23, 2009

0.12.6

Sep 15, 2009

0.12.5

Sep 8, 2009

0.12.3

Aug 25, 2009

0.12.2

Aug 18, 2009

0.12.1

Aug 17, 2009

0.11.2

May 6, 2009

0.11.1

May 5, 2009

0.10.4

Apr 21, 2009

0.10.3

Apr 17, 2009

0.10.2

Apr 15, 2009

0.10.1

Apr 9, 2009

0.9.13

Mar 5, 2009

0.9.12

Mar 2, 2009

0.9.11

Feb 25, 2009

0.9.10

Feb 19, 2009

0.9.9

Feb 19, 2009

0.9.8

Feb 18, 2009

0.9.7

Feb 9, 2009

0.9.6

Feb 3, 2009

0.9.5

Jan 26, 2009

0.9.2

Dec 17, 2008

0.8.5

Oct 22, 2008

0.8.5.dev pre-release

Oct 22, 2008

0.8.4

Sep 29, 2008

0.8.3

Aug 7, 2008

0.8.2

Aug 5, 2008

0.8.1

Jul 28, 2008

0.8.0

Jul 18, 2008

0.7.1

Jul 2, 2008

0.7.0

Jun 26, 2008

0.6.2

Jun 10, 2008

0.6.1

May 8, 2008

0.6.0

May 7, 2008

0.5.0

Mar 13, 2008

0.3.0

Jan 3, 2008

0.2.5

Dec 6, 2007

0.2.0

Nov 22, 2007

0.1.0

Sep 25, 2007

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwlib-0.18.3.tar.gz (5.2 MB view details)

Uploaded May 2, 2026 Source

File details

Details for the file mwlib-0.18.3.tar.gz.

File metadata

Download URL: mwlib-0.18.3.tar.gz
Upload date: May 2, 2026
Size: 5.2 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for mwlib-0.18.3.tar.gz
Algorithm	Hash digest
SHA256	`0826570d4a8d6d2569c9b574f1e59e332790eeda903ce3d1ee0f34dda2f8bcc0`
MD5	`33d1b4e685ace9c3e6c95b9d88bf7a5a`
BLAKE2b-256	`b24f7ea98493666a511f3747fe7a4e712947dc9db4c62ef1eba976dba5d48fa3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mwlib-0.18.3.tar.gz:

Publisher: release.yml on pediapress/mwlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mwlib-0.18.3.tar.gz
- Subject digest: 0826570d4a8d6d2569c9b574f1e59e332790eeda903ce3d1ee0f34dda2f8bcc0
- Sigstore transparency entry: 1428806738
- Sigstore integration time: May 2, 2026
Source repository:
- Permalink: pediapress/mwlib@abd64d7d5edbfe713efe8d42e5f9622921bd4b8f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/pediapress
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@abd64d7d5edbfe713efe8d42e5f9622921bd4b8f
- Trigger Event: workflow_dispatch

mwlib 0.18.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mwlib - MediaWiki Parser and Utility Library

Overview

Getting Started

Prerequisites

Installing uv

Installing mwlib

Documentation

Configuration

OAuth2 Configuration

OAuth2 Configuration Options

HTTP/2 Configuration Options

Example Usage

BigQuery Lookup for Image Description Pages

How it works

Prerequisites

BigQuery Configuration Options

Loading Data into BigQuery

Docker Compose Setup

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Provenance