Skip to main content

Wikipedia revision history parser for Python

Project description

WikiRevParser

WikiRevParser is a Python library that parses Wikipedia revision histories and allows you to analyse the development of pages on Wikipedia across all language versions.

The library extracts and parses Wikipedia revision histories from a language-page title pair and outputs clean, accessible data per timestamp in the revision history. You can use this library to access the development of references of a page, analyse the content or images over time, compare the tables of content across languages, create editor networks, and much more.

Example

To get the revision history for the page on Marie Curie on the English Wikipedia, run:

>>> from wikirevparser import wikirevparser
>>> parser_instance = wikirevparser.ProcessRevisions("en", "Marie Curie") 
>>> parser_instance.wikipedia_page()
>>> data = parser_instance.parse_revisions()

And you can access information like these:

about links:

>>> edits = list(data.items())
>>> first_links = edits[-1][1]["links"]
>>> latest_links = edits[0][1]["links"]
>>> print("Number of links in the first edit: %d." % len(first_links))
Number of links in the first edit: 1. 
>>> print("A link in the first edit: %s." % first_links[0])
A link in the first edit: pierre and marie curie. 
>>> print("Number of links in the latest edit: %d." % len(latest_links))
Number of links in the latest edit: 320. 
>>> print("A link in the latest edit: %s." % latest_links[0])
A link in the first edit: congress poland.

about editors:

>>> from collections import Counter
>>> editors = Counter()
>>> for timestamp in data:
>>>	  editors[data[timestamp]["user"]] += 1
>>> most_frequent = editors.most_common(1)[0]
>>> editor, edits = most_frequent[0], most_frequent[1]
>>> print("%s has edited the page the most, all of %d times (%d percent)!" % (editor, edits, (edits/len(data)*100)))
Nihil novi has edited the page the most, all of 619 times (13 percent)!

You could also investigate the use of images, the changes in tables of content, analyse differences across different language versions, and much, much more.

Installation

To install WikiRevParser, you can clone the repository on GitHub or simply run:

>>> pip install wikirevparser

Requirements

The WikiRevParser requires Python 3+.

You'll also need a few common Python libraries as well as our Wikipedia API wrapper (forked from Wikipedia by @goldsmith), which extracts and returns the entire revision history of a Wikipedia page.

Run the following to install all requirements needed:

>>> python3 install -r requirements.txt
>>> git clone git@github.com:ajoer/Wikipedia.git

The first command installs all requirements specified in the requirements.txt file, and the second command clones our version of the Wikipedia API wrapper needed for revision history extraction.

Documentation

Read the docs at readthedocs.io

License

This work is MIT licensed. See the LICENSE file for full details.

Credits

  • @goldsmith for the Python Wikipedia API wrapper Wikipedia.
  • The Wikimedia Foundation and all Wikipedians for creating and maintaining the data.
  • This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 812997.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikirevparser-0.0.3.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

wikirevparser-0.0.3-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file wikirevparser-0.0.3.tar.gz.

File metadata

  • Download URL: wikirevparser-0.0.3.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for wikirevparser-0.0.3.tar.gz
Algorithm Hash digest
SHA256 e3b9cb2a4e6b816687b67f8d270144c7d55e3ade95244799489e69d2428373e5
MD5 618d553eaacf754955c91595b3311d17
BLAKE2b-256 dc8bf650d530e1ac6533c609192c7dfa0cbaf0d115b2fc1431e567f726bbee53

See more details on using hashes here.

File details

Details for the file wikirevparser-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: wikirevparser-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for wikirevparser-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b1783327e68200f175abc40e3bdebf6866d8aedba832077b5a1f40d507153ebd
MD5 9cdf74a8f2b19489882333cf833bd2ef
BLAKE2b-256 79127d4c81a9d00dbb9a16412be36cd8e9a83494fef3c89b5ab6b8bb96be2a5c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page