Skip to main content

A collection of scripts and utilities to support the stream-processing of MediaWiki data.

Project description

A set of utilities for stream-processing MediaWiki data.


mwstream (-h | --help)

mwstream <utility> [-h|--help]

Data processing utilities

Generates token persistence statistics using revision JSON blobs with diff information.
Converts an XML dump to a stream of revision JSON blobs
Computes diffs directly from an XML dump
Computes and adds a “diff” field to a stream of revision JSON blobs
Mends diffs that were computed in chunks and out of order.
Aggregates a token persistence statistics to revision statistics
Converts a Wikihadoop-processed stream of XML pages to JSON blobs

General utilities

Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.
Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.
Validates JSON against a provided schema.
Truncates the ‘text’ field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean ‘truncated’ field.


pip install mwstreaming

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for mwstreaming, version 0.5.5
Filename, size File type Python version Upload date Hashes
Filename, size mwstreaming-0.5.5.tar.gz (12.5 kB) File type Source Python version None Upload date Hashes View
Filename, size (23.3 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page