Skip to main content

Wikipedia revision history parser for Python

Project description

WikiRevParser

WikiRevParser is a Python library that parses Wikipedia revision histories and allows you to analyse the development of pages on Wikipedia across all language versions.

The library extracts and parses Wikipedia revision histories from a language-page title pair and outputs clean, accessible data per timestamp in the revision history. You can use this library to access the development of references of a page, analyze the content or images over time, compare the tables of content across languages, create editor networks, and much more.

The WikiRevParser relies on our forked and modified version of the Python Wikipedia library, which in turn wraps the MediaWiki API for quick and easy access to Wikipedia data. Our modified version of the Wikipedia library extracts and returns the entire revision history of a page.

Installation

To install WikiRevParser, you can clone the repository on GitHub or simply run:

>>> pip install wikirevparser

The WikiRevParser is compatible with Python 3.4+, compatibility with earlier versions of Python has not been tested yet.

Example

To get the revision history for the page on knitting on the English Wikipedia, run:

>>> from WikiRevParser.wikirevparser import wikirevparser
>>> parser_instance = wikirevparser.ProcessRevisions("en", "Knitting") 
>>> parser_instance.wikipedia_page()
>>> data = parser_instance.parse_revisions()

And you can access information like these:

When and by whom was the first and last edit made?

>>> edits = list(data.items())
>>> first_timepoint = edits[-1][0]
>>> first_editor = edits[-1][1]["user"]
>>> last_timepoint = edits[0][0]
>>> last_editor = edits[0][1]["user"]
>>> print("%s first edited the page at %s, \n and it was last edited by %s at %s." % ( first_editor, first_timepoint, last_editor, last_timepoint))
# Janet Davis first edited the page at 2001-04-07T02:39:27Z, 
# and it was last edited by JavaHurricane at 2020-03-18T12:41:39Z.

Who has edited the page the most?

>>> from collections import Counter
>>> users = Counter()
>>> for timestamp in data:
>>>	  users[data[timestamp]["user"]] += 1
>>> print("%s has edited the page the most, all of %s times!" % (most_editing, number_edits))
# WillowW has edited the page the most, all of 93 times!

You could also investigate the use of images, the changes in tables of content, analyse differences between different language versions, and more.

Documentation

Read the docs at (https://wikirevparser.readthedocs.io/en/latest/)

License

This work is MIT licensed. See the LICENSE file for full details.

Credits

  • @goldsmith for the Python Wikipedia API wrapper Wikipedia.
  • The Wikimedia Foundation and all Wikipedians for creating and maintaining the data.
  • This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 812997.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikirevparser-0.0.2.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

wikirevparser-0.0.2-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file wikirevparser-0.0.2.tar.gz.

File metadata

  • Download URL: wikirevparser-0.0.2.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for wikirevparser-0.0.2.tar.gz
Algorithm Hash digest
SHA256 32446678d8a88343533f2cd6018bb04ea1a0c22c244464860fe0d7b3602fddb5
MD5 1ddb2e8047ad796b987079a939fbd93e
BLAKE2b-256 a66e0ce659fce148eb6089af6337bbecd89db7ea6bd157b1bfb13a5ec34e9789

See more details on using hashes here.

File details

Details for the file wikirevparser-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: wikirevparser-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for wikirevparser-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 139873a6228e5318ed5fa7a054878dafbb4135ecdd14b5c44459b41bb3ae97f6
MD5 444e39e9572e9c5e47fcc1dde28ca1f5
BLAKE2b-256 0ac500b485958390483a045fd8eb27cb0c47c41b927d2ce5730252ae66bdcbc9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page