Skip to main content

A Python tool to pull the complete edit history of a Wikipedia page

Project description

Wikipedia Histories

Downloads Downloads

A tool to pull the complete revision history of a Wikipedia page.

Installation

To install Wikipedia Histories, simply run:

$ pip install wikipedia-histories

Wikipedia Histories is compatible with Python 3.6+.

Usage

The module has basic functionality which allows it to be used to collect the revision history and metadata from a Wikipedia page in a convenient list of objects, which can be converted into a DataFrame. This also includes the article quality from every revision.

>>> import wikipedia_histories

# Generate a list of revisions for a specified page
>>> golden_swallow = wikipedia_histories.get_history('Golden swallow')

# Show the revision IDs for every edit
>>> golden_swallow
# [130805848, 162259515, 167233740, 195388442, ...

# Show the user who made a specific edit
>>> golden_swallow[16].user
# u'Snowmanradio'

# Show the text of at the time of a specific edit
>>> golden_swallow[16].content
# u'The Golden Swallow (Tachycineta euchrysea) is a swallow.  The Golden Swallow formerly'...
>>> golden_swallow[200].content
# u'The golden swallow (Tachycineta euchrysea) is a passerine in the swallow family'...

# Get the article rating at the time of the edit
>>> ratings = [revision.rating for revision in golden_swallow]
>>> ratings
# ['NA', 'NA', 'NA', 'NA', 'stub', 'stub', ...

# Get the time of each edit as a datetime object
>>> times = [revision.time for revision in golden_swallow]
>>> times
# [datetime.datetime(2007, 5, 14, 16, 15, 31), datetime.datetime(2007, 10, 4, 15, 36, 29), ...

# Generate a dataframe with text and metadata from a the list of revisions
>>> df = wikipedia_histories.to_df(golden_swallow)

Additional metadata for the article, including An example of this workflow is available in tests/demo.py.

Domain level analysis

This module also contains functionality for advanced analysis of large sets of Wikipedia articles by generation social networks based on the editors who edited an article. This functionality can be utilized by installing:

pip install wikipedia_histories[networks]

The toolkit is available at wikipedia_histories.networks.analyze_networks and wikipedia_histories.networks.network_builder.

First, a domain is defined as a dictionary or json, where keys are domain names and values are lists of categories which represent that domain. For example, a set of domains representing "culture" and "politics":

{
  "culture": [
      "Category:Television_in_the_United_States",
      "Category:American_films",
      "Category:American_novels"
   ],
   "politics": [
      "Category:Conservatism",
      "Category:Liberalism"
   ]
}

An example of this format is available in examples/domains.json.

The articles represented by those domains, up to a certain depth of nested categories, can be collected and saved as a csv, with the category and domain attributes attached using wikipedia_histories.networks.get_category_articles.find_articles(). Once this set of articles is collected, the articles themselves can be downloaded using wikipedia_histories.get_history() either with revision text or without. This set of articles can be used for analysis on Wikipedia revision behavior across categories or domains.

Once a set of articles is downloaded using this methodology, it's possible to collect aggregate metadata for those articles, including the number of unique editors, average added words per edit and average deleted words per edit, the article age, and the total number of edits, and save that information into a DataFrame using wikipedia_histories.get_metadata().

An example of this workflow is available in examples/collect_articles.py.

Social network analysis

It is also possible to build and analyze the networks of users who edited those articles, and study how domains relate to one another. For this analysis, first a set of articles representing categorical domains must be downloaded using and saved to folders representing domains and the metadata sheet must be saved.

Once this is set up, a set of networks representing connections within a domain or between domains can be generated. A domain is passed as input to signify which domain should be used to build the networks, if no domain is passed as input the networks generated will represent connections between categories from different domains.

In each network created, nodes represent articles and weighted edges represent the number of common editors between two articles. The function wikipedia_histories.networks.network_builder.generate_networks() allows generation of a certain number of networks with a specific number of nodes and a specific count--because they are generated by sampling from the downloaded articles, generating many networks represents bootstrapping of the dataset.

The function call:

networks = wikipedia_histories.networks.network_builder.generate_networks(
    count=1000,
    size=300,
    domain=domain,
    metadata_path=metadata_path,
    articles_path=articles_path,
)

would generate 1000 networks, each with 300 nodes, or 150 nodes from each selected category. Because the category input is None, the two selected categories would be from different domains. The metadata_path parameter is a path to the metadata sheet generated by the find_articles() function and the articles_path parameter is a path to the articles downloaded based on the find_articles() metadata.

The function returns a list of NetworkX objects. Networks can be written to the disk as .graphml files as part of the function by toggling the write parameter to True and passing an output_folder (note that this aspect is necessary for analysis).

Once generated, the networks can be analyzed using the get_network_metadata() function, which returns a DataFrame containing purity scores based on Louvain communities detected and assortativity scores for each network based on the categories represented by the networks.

An example of this workflow is available in examples/collect_networks.py.

Wikipedia Histories is compatible with Python 3.6+.

Notes

This package was used for a paper published by the McGill .txtlab: https://txtlab.org/2020/09/do-wikipedia-editors-specialize/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipedia_histories-1.1.0.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

wikipedia_histories-1.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file wikipedia_histories-1.1.0.tar.gz.

File metadata

  • Download URL: wikipedia_histories-1.1.0.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for wikipedia_histories-1.1.0.tar.gz
Algorithm Hash digest
SHA256 c3c5dfb6b5fe5a2b5a1473dc309493ea29ff99805ec4637410bd40398394bd8b
MD5 e09d95c276791ef2ceb7c029052367b0
BLAKE2b-256 f15dcc1201105ad0e975419dce674debd556485c1de1ee58dacaf56ad5df72fd

See more details on using hashes here.

Provenance

File details

Details for the file wikipedia_histories-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for wikipedia_histories-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6bb5798d7280a7e5ebb06185867691010883fd2ee0340d726513a8fb93fc4bb8
MD5 8fe7b32a3362546207c2c1b1882c4316
BLAKE2b-256 b9ae9f0d322c3da503d3bd8c623caf331fd956da3ba99e588c46c6f87eacd15b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page