Skip to main content

A collection of scripts and utilities to support the stream-processing of MediaWiki data.

Project description

A set of utilities for stream-processing MediaWiki data.


mwstream (-h | --help)

mwstream <utility> [-h|--help]

Data processing utilities


Generates token persistence statistics using revision JSON blobs with diff information.


Converts an XML dump to a stream of revision JSON blobs


Computes and adds a “diff” field to a stream of revision JSON blobs


Aggregates a token persistence statistics to revision statistics


Converts a Wikihadoop-processed stream of XML pages to JSON blobs

General utilities


Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.


Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.


Validates JSON against a provided schema.


Truncates the ‘text’ field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean ‘truncated’ field.


pip install mwstreaming

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions (19.6 kB view hashes)

Uploaded Source

mwstreaming-0.5.0.tar.gz (10.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page