Skip to main content

Wikipedia revision history parser for Python

Project description

WikiRevParser

WikiRevParser is a Python library that parses Wikipedia revision histories. It allows you to analyse the development of pages on Wikipedia over time and across language versions.

The library takes a language code and Wikipedia page title as input, extracts the revision history, and parses the noisy, unstructured content into clean, accessible data for each timestamp in the revision history. You can use this library to access the development of references of a page, analyse the content or images over time, compare the tables of content across languages, create editor networks, and much more.

Get Started

Beside the WikiRevParser, you'll need our version of the Wikipedia API wrapper, which extracts and returns the entire revision history of a Wikipedia page. Note that Python3+ is required.

$ pip3 install wikirevparser
$ git clone git@github.com:ajoer/Wikipedia.git

Example

To get the revision history for the page on Marie Curie on the English Wikipedia, run:

>>> from wikirevparser import wikirevparser
>>> parser_instance = wikirevparser.ProcessRevisions("en", "Marie Curie") 
>>> parser_instance.wikipedia_page()
>>> data = parser_instance.parse_revisions()

Now you have the revisions of the Marie Curie page in a structured dictionary format, and you can start exploring the data.

Let's look at the use of links. I want to know whether the links on the page are the same now as when the page was first made?

>>> edits = list(data.items())
>>> first_links = edits[-1][1]["links"]
>>> latest_links = edits[0][1]["links"]
>>> present_now = first_links[0] in latest_links 
>>> print("The only link in the first version was '%s'.\nThat link is still present in the current version: %s." % (first_links[0], present_now))
The only link in the first version was 'pierre and marie curie'.
That link is still present in the current version: False.

Okay, but what are then the most frequent links on the page now?

>>> from collections import Counter
>>> links = Counter()
>>> for l in latest_links:
...	links[l] += 1
>> print(links)
Counter({'polonium': 5, 'radium': 5, 'university of paris': 5, 'russian empire': 4, 'gabriel lippmann': 4, 'nobel prize in physics': 4, 'nobel prize in chemistry': 4, ... })

With the parsed revision history, you could also get answers for questions like these:

  • When was the 'pierre and marie curie' link deleted?
  • Who made that edit?
  • Did that editor also edit the Afrikaans page on Marie Curie?
  • What are the most referenced sources on the page?
  • Which references are used on both the English page and the Arabic one?
  • How many Wikipedians have edited the English page? And the Dutch page?
  • Do all language versions use the same image of Marie Curie as the top image?
  • Where are the Wikipedians located?
  • How frequently is the page edited?
  • Has the English page developed consistently or did editing intensify at one point?
  • How does the editing pattern of the English page match that of the Korean page?
  • ... and many other questions

Read the documentation for more inspiration and functionalities, and go to FAQ or file a bug if you run into issues!

Documentation

Read the docs at wikirevparser.readthedocs.io for more details and use case examples.

License

This work is MIT licensed. See the LICENSE file for full details.

Credits

  • @goldsmith for the Python Wikipedia API wrapper Wikipedia.
  • The Wikimedia Foundation and all Wikipedians for creating and maintaining the data.
  • This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 812997.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikirevparser-0.0.8.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

wikirevparser-0.0.8-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file wikirevparser-0.0.8.tar.gz.

File metadata

  • Download URL: wikirevparser-0.0.8.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for wikirevparser-0.0.8.tar.gz
Algorithm Hash digest
SHA256 78adc92496322511a49f39f477e47a0d26ed5735a16940bda7d130010b2b877e
MD5 b197c2905027f63fef7ee17f6e96f786
BLAKE2b-256 5a1288da2c4b52949cdf0f1a8c2df8bcc960d0f46d9528476cfe2933adc1f146

See more details on using hashes here.

File details

Details for the file wikirevparser-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: wikirevparser-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for wikirevparser-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 91ec9f2e389f3f4924acebfac47396c6d34b1bbea20ca0b121b2070edd5ececa
MD5 394b7500aa3c03ef16d5807d35894ada
BLAKE2b-256 468adcadb40288b3138ba11770296d43ae8c2cf87e034dd02a45d1f45dbcefb9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page