A Python tool to pull the complete edit history of a Wikipedia page
Project description
Wikipedia Histories
A tool to pull the complete revision history of a Wikipedia page.
Installation
To install Wikipedia Histories, simply run:
$ pip install wikipedia-histories
Wikipedia Histories is compatible with Python 3.6+.
Usage
The module has basic functionality which allows it to be used to collect the revision history and metadata from a Wikipedia page in a convenient list of objects, which can be converted into a DataFrame. This also includes the article quality from every revision.
>>> import wikipedia_histories
# Generate a list of revisions for a specified page
>>> golden_swallow = wikipedia_histories.get_history('Golden swallow')
# Show the revision IDs for every edit
>>> golden_swallow
# [130805848, 162259515, 167233740, 195388442, ...
# Show the user who made a specific edit
>>> golden_swallow[16].user
# u'Snowmanradio'
# Show the text of at the time of a specific edit
>>> golden_swallow[16].content
# u'The Golden Swallow (Tachycineta euchrysea) is a swallow. The Golden Swallow formerly'...
>>> golden_swallow[200].content
# u'The golden swallow (Tachycineta euchrysea) is a passerine in the swallow family'...
# Get the article rating at the time of the edit
>>> ratings = [revision.rating for revision in golden_swallow]
>>> ratings
# ['NA', 'NA', 'NA', 'NA', 'stub', 'stub', ...
# Get the time of each edit as a datetime object
>>> times = [revision.time for revision in golden_swallow]
>>> times
# [datetime.datetime(2007, 5, 14, 16, 15, 31), datetime.datetime(2007, 10, 4, 15, 36, 29), ...
# Generate a dataframe with text and metadata from a the list of revisions
>>> df = wikipedia_histories.to_df(golden_swallow)
Additional metadata for the article, including
An example of this workflow is available in tests/demo.py
.
Domain level analysis
This module also contains functionality for advanced analysis of large sets of Wikipedia articles by generation social networks based on the editors who edited an article. This functionality can be utilized by installing:
pip install wikipedia_histories[networks]
The toolkit is available at wikipedia_histories.networks.analyze_networks
and wikipedia_histories.networks.network_builder
.
First, a domain is defined as a dictionary
or json
, where keys are domain names and values are lists of categories which represent that domain. For example, a set of domains representing "culture" and "politics":
{
"culture": [
"Category:Television_in_the_United_States",
"Category:American_films",
"Category:American_novels"
],
"politics": [
"Category:Conservatism",
"Category:Liberalism"
]
}
An example of this format is available in examples/domains.json
.
The articles represented by those domains, up to a certain depth of nested categories, can be collected and saved as a csv
, with the category and domain attributes attached using wikipedia_histories.networks.get_category_articles.find_articles()
. Once this set of articles is collected, the articles themselves can be downloaded using wikipedia_histories.get_history()
either with revision text or without. This set of articles can be used for analysis on Wikipedia revision behavior across categories or domains.
Once a set of articles is downloaded using this methodology, it's possible to collect aggregate metadata for those articles, including the number of unique editors, average added words per edit and average deleted words per edit, the article age, and the total number of edits, and save that information into a DataFrame using wikipedia_histories.get_metadata()
.
An example of this workflow is available in examples/collect_articles.py
.
Social network analysis
It is also possible to build and analyze the networks of users who edited those articles, and study how domains relate to one another. For this analysis, first a set of articles representing categorical domains must be downloaded using and saved to folders representing domains and the metadata sheet must be saved.
Once this is set up, a set of networks representing connections within a domain or between domains can be generated. A domain
is passed as input to signify which domain should be used to build the networks, if no domain
is passed as input the networks generated will represent connections between categories from different domains.
In each network created, nodes represent articles and weighted edges represent the number of common editors between two articles. The function wikipedia_histories.networks.network_builder.generate_networks()
allows generation of a certain number of networks with a specific number of nodes and a specific count--because they are generated by sampling from the downloaded articles, generating many networks represents bootstrapping of the dataset.
The function call:
networks = wikipedia_histories.networks.network_builder.generate_networks(
count=1000,
size=300,
domain=domain,
metadata_path=metadata_path,
articles_path=articles_path,
)
would generate 1000 networks, each with 300 nodes, or 150 nodes from each selected category. Because the category input is None
, the two selected categories would be from different domains. The metadata_path
parameter is a path to the metadata sheet generated by the find_articles()
function and the articles_path
parameter is a path to the articles downloaded based on the find_articles()
metadata.
The function returns a list of NetworkX
objects. Networks can be written to the disk as .graphml
files as part of the function by toggling the write
parameter to True
and passing an output_folder
(note that this aspect is necessary for analysis).
Once generated, the networks can be analyzed using the get_network_metadata()
function, which returns a DataFrame containing purity scores based on Louvain communities detected and assortativity scores for each network based on the categories represented by the networks.
An example of this workflow is available in examples/collect_networks.py
.
Wikipedia Histories is compatible with Python 3.6+.
Notes
This package was used for a paper published by the McGill .txtlab: https://txtlab.org/2020/09/do-wikipedia-editors-specialize/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for wikipedia_histories-1.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3c5dfb6b5fe5a2b5a1473dc309493ea29ff99805ec4637410bd40398394bd8b |
|
MD5 | e09d95c276791ef2ceb7c029052367b0 |
|
BLAKE2b-256 | f15dcc1201105ad0e975419dce674debd556485c1de1ee58dacaf56ad5df72fd |
Hashes for wikipedia_histories-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bb5798d7280a7e5ebb06185867691010883fd2ee0340d726513a8fb93fc4bb8 |
|
MD5 | 8fe7b32a3362546207c2c1b1882c4316 |
|
BLAKE2b-256 | b9ae9f0d322c3da503d3bd8c623caf331fd956da3ba99e588c46c6f87eacd15b |