A set of utilities for processing MediaWiki XML dump data.
Project description
# MediaWiki XML
This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing. This library enables memory efficent stream processing of XML dumps with a simple [iterator](https://pythonhosted.org/mwxml/iteration.html) strategy. This library also implements a distributed processing strategy (see [map()](https://pythonhosted.org/mwxml/map.html)) that enables parallel processing of many XML dump files at the same time.
Installation: pip install mwxml
Documentation: https://pythonhosted.org/mwxml
Repositiory: https://github.com/mediawiki-utilities/python-mwxml
License: MIT
## Example
>>> import mwxml >>> >>> dump = mwxml.Dump.from_file(open("dump.xml")) >>> print(dump.site_info.name, dump.site_info.dbname) Wikipedia enwiki >>> >>> for page in dump: ... for revision in page: ... print(revision.id) ... 1 2 3
## Author * Aaron Halfaker – https://github.com/halfak
## See also * http://dumps.wikimedia.org/ * http://community.wikia.com/wiki/Help:Database_download
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mwxml-0.3.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84d032d99ae6f7767a83d2fb5901617a04648f8ad6380a5823c6b7de21c2e8be |
|
MD5 | 9e482a2b7503bdffbad554db51e2b1b1 |
|
BLAKE2b-256 | 006ade5b0c4a7f765edb4567abd84dd751ed06a0e5fe2f0f9da5945318cf6df6 |