Skip to main content

A collection of scripts and utilities to support the stream-processing of MediaWiki data.

Project description

A set of utilities for stream-processing MediaWiki data.

Usage

mwstream (-h | --help)

mwstream <utility> [-h|--help]

Data processing utilities

diffs2persistence

Generates token persistence statistics using revision JSON blobs with diff information.

dump2json

Converts an XML dump to a stream of revision JSON blobs

json2diffs

Computes and adds a “diff” field to a stream of revision JSON blobs

persistence2stats

Aggregates a token persistence statistics to revision statistics

wikihadoop2json

Converts a Wikihadoop-processed stream of XML pages to JSON blobs

General utilities

json2tsv

Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.

normalize

Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.

validate

Validates JSON against a provided schema.

truncate_text

Truncates the ‘text’ field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean ‘truncated’ field.

Installation

pip install mwstreaming

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

mwstreaming-0.5.0.zip (19.6 kB view details)

Uploaded Source

mwstreaming-0.5.0.tar.gz (10.7 kB view details)

Uploaded Source

File details

Details for the file mwstreaming-0.5.0.zip.

File metadata

  • Download URL: mwstreaming-0.5.0.zip
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwstreaming-0.5.0.zip
Algorithm Hash digest
SHA256 f4b2bc578766aa8f74edef5b1be60af80a5b2225b9dcc2e10ed5c08f8ae41afb
MD5 a4528454fe515820dc33d440a72061db
BLAKE2b-256 3c79e02468c03532922b6c0ba82c4987a9d43dfc0123981785554f62e12675eb

See more details on using hashes here.

File details

Details for the file mwstreaming-0.5.0.tar.gz.

File metadata

  • Download URL: mwstreaming-0.5.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwstreaming-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c4fef8ffb86ee873ef07d98d0090c37851f0e3f73b4cd4839d91088be1b47b20
MD5 c56bb783f879d17e253616892b92f180
BLAKE2b-256 e8f986b557963de3b26d698521577548cc942cfb0f43b7dbce196810e505b5c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page