Python based Wikidata framework for easy dataframe extraction

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Project description

Python based Wikidata framework for easy dataframe extraction

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information. The goal is to create an intuitive interface so that Wikidata can function as a common read-write repository for public statistics.

See the documentation for a full outline of the package including usage and available data.

Installation
Data
- Query Data
- Upload Data (WIP)
Maps (WIP)
Examples
To-Do

Installation `⇧`

wikirepo can be downloaded from PyPI via pip or sourced directly from this repository:

pip install wikirepo

git clone https://github.com/andrewtavis/wikirepo.git
cd wikirepo
python setup.py install

import wikirepo

Data `⇧`

wikirepo's data structure is built around Wikidata.org. Human-readable access to Wikidata statistics is achieved through converting requests into Wikidata's Quantity IDs (QIDs) and Property IDs (PIDs), with the Python package wikidata serving as a basis for data loading and indexing. See the documentation for a structured overview of the currently available properties.

Query Data `⇧`

wikirepo's main access function, wikirepo.data.query, returns a pandas.DataFrame of locations and property data across time.

Each query needs the following inputs:

locations: the locations that data should be queried for
- Strings are accepted for Earth, continents, and countries
- Get all country names with wikirepo.data.incl_lctn_lbls(lctn_lvls='country')
- The user can also pass Wikidata QIDs directly
depth: the geographic level of the given locations to query
- A depth of 0 is the locations themselves
- Greater depths correspond to lower geographic levels (states of countries, etc.)
- A dictionary of locations is generated for lower depths (see second example below)
timespan: start and end datetime.date objects defining when data should come from
- If not provided, then the most recent data will be retrieved with annotation for when it's from
interval: yearly, monthly, weekly, or daily as strings
Further arguments: the names of modules in wikirepo/data directories
- These are passed to arguments corresponding to their directories
- Data will be queried for these properties for the given locations, depth, timespan and interval, with results being merged as dataframe columns

Queries are also able to access information in Wikidata sub-pages for locations. For example: if inflation rate is not found on the location's main page, then wikirepo checks the location's economic topic page as inflation_rate.py is found in wikirepo/data/economic (see Germany and economy of Germany).

wikirepo further provides a unique dictionary class, EntitiesDict, that stores all loaded Wikidata entities during a query. This speeds up data retrieval, as entities are loaded once and then accessed in the EntitiesDict object for any other needed properties.

Examples of wikirepo.data.query follow:

Querying Information for Given Countries

import wikirepo
from wikirepo.data import wd_utils
from datetime import date

ents_dict = wd_utils.EntitiesDict()
# Strings must match their Wikidata English page names
countries = ["Germany", "United States of America", "People's Republic of China"]
# countries = ["Q183", "Q30", "Q148"] # we could also pass QIDs
# data.incl_lctn_lbls(lctn_lvls='country') # or all countries`
depth = 0
timespan = (date(2009, 1, 1), date(2010, 1, 1))
interval = "yearly"

df = wikirepo.data.query(
    ents_dict=ents_dict,
    locations=countries,
    depth=depth,
    timespan=timespan,
    interval=interval,
    climate_props=None,
    demographic_props=["population", "life_expectancy"],
    economic_props="median_income",
    electoral_poll_props=None,
    electoral_result_props=None,
    geographic_props=None,
    institutional_props="human_dev_idx",
    political_props="executive",
    misc_props=None,
    verbose=True,
)

col_order = [
    "location",
    "qid",
    "year",
    "executive",
    "population",
    "life_exp",
    "human_dev_idx",
    "median_income",
]
df = df[col_order]

df.head(6)

location	qid	year	executive	population	life_exp	human_dev_idx	median_income
Germany	Q183	2010	Angela Merkel	8.1752e+07	79.9878	0.921	33333
Germany	Q183	2009	Angela Merkel	nan	79.8366	0.917	nan
United States of America	Q30	2010	Barack Obama	3.08746e+08	78.5415	0.914	43585
United States of America	Q30	2009	George W. Bush	nan	78.3902	0.91	nan
People's Republic of China	Q148	2010	Wen Jiabao	1.35976e+09	75.236	0.706	nan
People's Republic of China	Q148	2009	Wen Jiabao	nan	75.032	0.694	nan

Querying Information for all US Counties

# Note: >3000 regions, expect a 45 minute runtime
import wikirepo
from wikirepo.data import lctn_utils, wd_utils
from datetime import date

ents_dict = wd_utils.EntitiesDict()
country = "United States of America"
# country = "Q30" # we could also pass its QID
depth = 2  # 2 for counties, 1 for states and territories
sub_lctns = True  # for all
# Only valid sub-locations given the timespan will be queried
timespan = (date(2016, 1, 1), date(2018, 1, 1))
interval = "yearly"

us_counties_dict = lctn_utils.gen_lctns_dict(
    ents_dict=ents_dict,
    locations=country,
    depth=depth,
    sub_lctns=sub_lctns,
    timespan=timespan,
    interval=interval,
    verbose=True,
)

df = wikirepo.data.query(
    ents_dict=ents_dict,
    locations=us_counties_dict,
    depth=depth,
    timespan=timespan,
    interval=interval,
    climate_props=None,
    demographic_props="population",
    economic_props=None,
    electoral_poll_props=None,
    electoral_result_props=None,
    geographic_props="area",
    institutional_props="capital",
    political_props=None,
    misc_props=None,
    verbose=True,
)

df[df["population"].notnull()].head(6)

location	sub_lctn	sub_sub_lctn	qid	year	population	area_km2	capital
United States of America	California	Alameda County	Q107146	2018	1.6602e+06	2127	Oakland
United States of America	California	Contra Costa County	Q108058	2018	1.14936e+06	2078	Martinez
United States of America	California	Marin County	Q108117	2018	263886	2145	San Rafael
United States of America	California	Napa County	Q108137	2018	141294	2042	Napa
United States of America	California	San Mateo County	Q108101	2018	774155	1919	Redwood City
United States of America	California	Santa Clara County	Q110739	2018	1.9566e+06	3377	San Jose

Upload Data (WIP) `⇧`

wikirepo.data.upload will be the core of the eventual wikirepo upload feature. The goal is to record edits that a user makes to a previously queried or baseline dataframe such that these changes can then be pushed back to Wikidata. With the addition of Wikidata login credentials as a wikirepo feature (WIP), the unique information in the edited dataframe could then be uploaded to Wikidata for all to use.

The same process used to query information from Wikidata could be reversed for the upload process. Dataframe columns could be linked to their corresponding Wikidata properties, whether the time qualifiers are a point in time or spans using start time and end time could be derived through the defined variables in the module header, and other necessary qualifiers for proper data indexing could also be included. Source information could also be added in corresponding columns to the given property edits.

Pseudocode for how this process could function follows:

In the first example, changes are made to a df.copy() of a queried dataframe. pandas is then used to compare the new and original dataframes after the user has added information that they have access to.

import wikirepo
from wikirepo.data import lctn_utils, wd_utils
from datetime import date

credentials = wd_utils.login()

ents_dict = wd_utils.EntitiesDict()
country = "Country Name"
depth = 2
sub_lctns = True
timespan = (date(2000,1,1), date(2018,1,1))
interval = 'yearly'

lctns_dict = lctn_utils.gen_lctns_dict()

df = wikirepo.data.query()
df_copy = df.copy()

# The user checks for NaNs and adds data

df_edits = pd.concat([df, df_copy]).drop_duplicates(keep=False)

wikirepo.data.upload(df_edits, credentials)

In the next example data.data_utils.gen_base_df is used to create a dataframe with dimensions that match a time series that the user has access to. The data is then added to the column that corresponds to the property to which it should be added. Source information could further be added via a structured dictionary generated for the user.

import wikirepo
from wikirepo.data import data_utils, wd_utils
from datetime import date

credentials = wd_utils.login()

locations = "Country Name"
depth = 0
# The user defines the time parameters based on their data
timespan = (date(1995,1,2), date(2010,1,2)) # (first Monday, last Sunday)
interval = 'weekly'

base_df = data_utils.gen_base_df()
base_df['data'] = data_for_matching_time_series

source_data = wd_utils.gen_source_dict('Source Information')
base_df['data_source'] = [source_data] * len(base_df)

wikirepo.data.upload(base_df, credentials)

Put simply: a full featured wikirepo.data.upload function would realize the potential of a single read-write repository for all public information.

Maps (WIP) `⇧`

wikirepo/maps is a further goal of the project, as it combines wikirepo's focus on easy to access open source data and quick high level analytics.

• Query Maps

As in wikirepo.data.query, passing the locations, depth, timespan and interval arguments could access GeoJSON files stored on Wikidata, thus providing mapping files in parallel to the user's data. These files could then be leveraged using existing Python plotting libraries to provide detailed presentations of geographic analysis.

• Upload Maps

Similar to the potential of adding statistics through wikirepo.data.upload, GeoJSON map files could also be uploaded to Wikidata using appropriate arguments. The potential exists for a myriad of variable maps given locations, depth, timespan and interval information that would allow all wikirepo users to get the exact mapping file that they need for their given task.

Examples `⇧`

wikirepo can be used as a foundation for countless projects, with its usefulness and practicality only improving as more properties are added and more data is uploaded to Wikidata.

Current usage examples include:

Sample notebooks for the Python package poli-sci-kit show how to use wikirepo as a basis for political election and parliamentary appointment analysis, with those notebooks being found in the examples for poli-sci-kit or on Google Colab
Pull requests with other examples will gladly be accepted

To-Do `⇧`

Please see the contribution guidelines if you are interested in contributing to this project. Work that is in progress or could be implemented includes:

Expanding wikirepo

Creating an outline of the package's structure for the readme (see issue)
Integrating current Python tools with wikirepo structures for uploads to Wikidata
Adding a query of property descriptions to data.data_utils.incl_dir_idxs (see issue)
Adding multiprocessing support to the wikirepo.data.query process and data.lctn_utils.gen_lctns_dict
Potentially converting wikirepo.data.query and data.lctn_utils.gen_lctns_dict over to generated Wikidata SPARQL queries
Optimizing wikirepo.data.query:
- Potentially converting EntitiesDict and LocationsDict to slotted object classes for memory savings
- Deriving and optimizing other slow parts of the query process
Adding access to GeoJSON files for mapping via wikirepo.maps.query
Designing and adding GeoJSON files indexed by time properties to Wikidata
Creating, improving and sharing examples
Improving tests for greater code coverage
Improving code quality by refactoring large functions and checking conventions

Expanding Wikidata

The growth of wikirepo's database relies on that of Wikidata. Through data.wd_utils.dir_to_topic_page wikirepo can access properties on location sub-pages, thus allowing for statistics on any topic to be linked to. Beyond including entries for already existing properties (see this issue), the following are examples of property types that could be added:

Climate statistics could be added to data/climate
- This would allow for easy modeling of global warming and its effects
- Planning would be needed for whether lower intervals would be necessary, or just include daily averages
Those for electoral polling and results for locations
- This would allow direct access to all needed election information in a single function call
A property that links political parties and their regions in data/political
- For easy professional presentation of electoral results (ex: loading in party hex colors, abbreviations, and alignments)
data/demographic properties such as:
- age, education, religious, and linguistic diversities across time
data/economic properties such as:
- female workforce participation, workforce industry diversity, wealth diversity, and total working age population across time
Distinct properties for Freedom House and Press Freedom indexes, as well as other descriptive metrics
- These could be added to data/institutional

Similar Projects

Python

JavaScript

Java

https://github.com/Wikidata/Wikidata-Toolkit

Powered By

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

1.0.1

Jul 9, 2022

1.0.0

Dec 28, 2021

0.1.1.7

Apr 28, 2021

0.1.1.6

Mar 30, 2021

0.1.1.5

Mar 28, 2021

0.1.1.4

Mar 21, 2021

0.1.1.3

Mar 21, 2021

0.1.1.2

Mar 18, 2021

0.1.1.1

Mar 17, 2021

0.1.1

Mar 17, 2021

0.1.0

Feb 23, 2021

0.0.2.8

Jan 27, 2021

0.0.2.7

Jan 27, 2021

0.0.2.6

Jan 25, 2021

0.0.2.5

Dec 12, 2020

0.0.2.4

Dec 12, 2020

0.0.2.3

Dec 12, 2020

0.0.2.2

Dec 11, 2020

0.0.2.1

Dec 9, 2020

0.0.2

Dec 8, 2020

0.0.1

Dec 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

wikirepo-1.0.1-py3-none-any.whl (62.7 kB view hashes)

Uploaded Jul 9, 2022 Python 3

Hashes for wikirepo-1.0.1-py3-none-any.whl

Hashes for wikirepo-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`746768393f0d165e97711a6da8a137b1f298ee8c6d6056a6e727928e9e639a50`
MD5	`9d7c086a81d112b983e0d972f92502aa`
BLAKE2b-256	`b592d5908b1aa65f7ad3618d3a26b34e988208396b16eb6c1b9ad791e7e37b31`

wikirepo 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Python based Wikidata framework for easy dataframe extraction

Contents

Installation `⇧`

Data `⇧`

Query Data `⇧`

Querying Information for Given Countries

Querying Information for all US Counties

Upload Data (WIP) `⇧`

Maps (WIP) `⇧`

• Query Maps

• Upload Maps

Examples `⇧`

To-Do `⇧`

Expanding wikirepo

Expanding Wikidata

Similar Projects

Powered By

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

wikirepo 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Python based Wikidata framework for easy dataframe extraction

Contents

Installation ⇧

Data ⇧

Query Data ⇧

Querying Information for Given Countries

Querying Information for all US Counties

Upload Data (WIP) ⇧

Maps (WIP) ⇧

• Query Maps

• Upload Maps

Examples ⇧

To-Do ⇧

Expanding wikirepo

Expanding Wikidata

Similar Projects

Powered By

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

Installation `⇧`

Data `⇧`

Query Data `⇧`

Upload Data (WIP) `⇧`

Maps (WIP) `⇧`

Examples `⇧`

To-Do `⇧`