A simple but fast python script that reads the XML dump of a wiki and output the processed data in a CSV file.
Project description
A simple but fast python script that reads the XML dump of a wiki and output the processed data in a CSV file.
All revisions history of a mediawiki wiki can be backed up as an XML file, known as a XML dump. This file is a record of all the edits made in a wiki with all the corresponding data regarding date, page, author and the full content within the edit.
Very often we just want the metadata for the edit regarding date, author and page; and therefore, we do not need the content of the edit, which by far the longest piece of data.
This script converts this very long XML dump in csv files much smaller and easiest to read and work with. It takes care of
Usage
Install the package using pip:
pip install wiki_dump_parser
Then, use it directly from command line:
python -m wiki_dump_parser <dump.xml>
Or from python code:
import wiki_dump_parser as parser
parser.xml_to_csv('dump.xml')
The output csv files should be loaded using ‘|’ as an escape character for quoting string. An example to load the output file “dump.csv” generated by this script using pandas would be:
df = pd.read_csv('dump.csv', quotechar='|', index_col = False)
df['timestamp'] = pd.to_datetime(df['timestamp'],format='%Y-%m-%dT%H:%M:%SZ')
Dependencies
python 3
Yes, nothing more.
How to get a wiki history dump
There are several ways to get the wiki dump:
If you have access to the server, follow the instructions in the mediawiki docs.
For Wikia wikis and many other domains, you can use our in-house developed script made to accomplish this task. It is straightforward to use and very fast on it.
Wikimedia project wikis: For wikis belonging to the Wikimedia project, you already have a regular updated repo with all the dumps here: http://dumps.wikimedia.org. Select your target wiki from the list and download the complete edit history dump and uncompress it.
For other wikis, like self-hosted wikis, you should use the wikiteam’s dumpgenerator.py script. You have a simple tutorial in their wiki. Its usage is very straightforward and the script is well maintained. Remember to use the –xml option to download the full history dump.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file wiki_dump_parser-2.0.2.tar.gz
.
File metadata
- Download URL: wiki_dump_parser-2.0.2.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 402783a25ff3cca8eb82eec29beb97859be56ee960ff99db9ab5f662ba5f9d0d |
|
MD5 | 8919c75dcbf8726d123d0adba2385df9 |
|
BLAKE2b-256 | f8ed2dfd3d67d0991546fe2906498ace6ad29cb75b901ae13b66343925225d79 |