Skip to main content

Python based ETL and ELT tools for Wikidata

Project description


PyPI Version Python Version GitHub

Python based ETL and ELT tools for Wikidata

Jump to: Data Query Data Upload Data Maps Query Maps Upload Maps To-Do

wikirepo is a Python package that provides ETL tools to easily source and leverage standardized Wikidata information. The current focus is to create an intuitive interface so that Wikidata can function as a common repository for social science statistics.

Installation via PyPi

pip install wikirepo
import wikirepo

Data

wikirepo's data structure is built around Wikidata.org. Human-readable access to Wikidata statistics is achieved through converting requests into Wikidata's Query IDs (QIDs) and Property IDs (PIDs), with the Python package wikidata serving as a basis for data loading and indexing. The wikirepo community aims to work with Wikidata to derive and add needed statistics, thus playing an integral role in growing the premier free and open sourced online knowledge base.

Query Data

wikirepo's main ETL access function, wikirepo.data.query, returns a pandas.DataFrame of locations and property data across time. wikirepo.data.query accessses data.data_utils.query_repo_dir, with desired statistics coming from the query_property functions of wikirepo/data directory modules, and results then being merged across modules and directories.

The query structure streamlines not just data extraction, but also the process of adding new wikirepo properties for all to use. Adding a new property is as simple as adding a module to an appropriate wikirepo/data directory, with most data modules being as simple as six defined variables and a single function call. wikirepo is self indexing, so any property module added is accessible by wikirepo.data.query. See data/demographic/population for the general structure of data modules, and examples/add_property for a quick demo on adding new properties.

Each query needs the following inputs:

  • locations: the locations that data should be queried for
    • Strings are accepted for Earth, continents, and countries
    • The user can also pass Wikidata QIDs directly
  • depth: the geographic level of the given locations to query
    • A depth of 0 is the locations themselves
    • Greater depths correspond to lower geographic levels (states of countries, etc.)
    • A dictionary of locations is generated for lower depths (see second example below)
  • time_lvl: yearly, monthly, weekly, or daily as strings
    • If not provided, then the most recent data will be retrieved with annotation for when it's from
  • timespan: start and end datetime.date objects to be subsetted based on time_lvl
  • Further arguments: the names of modules in wikirepo/data's directories
    • These are passed to arguments corresponding to their directories
    • Data will be queried for these properties for the given locations, depth, time_lvl and timespan, with results being merged as dataframe columns

Queries are also able to access information in Wikidata sub-pages for locations. For example: if inflation rate is not found on the location's main page, then wikirepo checks the location's economic topic page as inflation.py is found in wikirepo/data/economic (see Germany and economy of Germany).

wikirepo further provides a unique dictionary class, EntitiesDict, that stores all loaded Wikidata entities during a query. This speeds up data retrieval, as entities are loaded once and then accessed in the EntitiesDict object for any other needed properties.

Examples of wikirepo.data.query follow:

Querying information for given countries

import wikirepo
from wikirepo.data import wd_utils
from datetime import date

ents_dict = wd_utils.EntitiesDict()
countries = ["Germany", "United States of America", "People's Republic of China"]
depth = 0
time_lvl = 'yearly'
timespan = (date(2009,1,1), date(2010,1,1))

df = wikirepo.data.query(ents_dict=ents_dict, 
                         locations=countries, depth=depth,
                         time_lvl=time_lvl, timespan=timespan,
                         demographic_props=['population', 'life_expectancy'], 
                         economic_props=['nom_gdp', 'median_income'], 
                         electoral_poll_props=False, 
                         electoral_result_props=False,
                         geographic_props='continent', 
                         institutional_props='human_dev_idx',
                         political_props='executive',
                         misc_props='country_abbr',
                         verbose=True)

col_order = ['location', 'qid', 'year', 'abbr', 'continent', 'executive', 
             'population', 'life_exp', 'human_dev_idx', 'nom_gdp', 'median_income']
df = df[col_order]

df.head(6)
location qid year abbr continent executive population life_exp human_dev_idx nom_gdp median_income
Germany Q183 2010 DE Europe Angela Merkel 8.1752e+07 79.9878 0.921 3.41709e+12 33333
Germany Q183 2009 DE Europe Angela Merkel nan 79.8366 0.917 3.41801e+12 nan
United States of America Q30 2010 US North America, Oceania Barack Obama 3.08746e+08 78.5415 0.914 1.49644e+13 43585
United States of America Q30 2009 US North America, Oceania George W. Bush nan 78.3902 0.91 1.44187e+13 nan
People's Republic of China Q148 2010 CN Asia Wen Jiabao 1.35976e+09 75.236 0.706 6.10062e+12 nan
People's Republic of China Q148 2009 CN Asia Wen Jiabao nan 75.032 0.694 5.10995e+12 nan

Querying information for all US counties (≈3000 regions, expect an hour runtime)

import wikirepo
from wikirepo.data import lctn_utils, wd_utils
from datetime import date

ents_dict = wd_utils.EntitiesDict()
depth = 2
country = "United States of America"
sub_lctns = True # for all
time_lvl = 'yearly'
# Only valid sub-locations given the timespan will be queried
timespan = (date(2016,1,1), date(2018,1,1))

us_counties_dict = lctn_utils.gen_lctns_dict(ents_dict=ents_dict,
                                             depth=depth,
                                             locations=country, 
                                             sub_lctns=sub_lctns,
                                             time_lvl=time_lvl, 
                                             timespan=timespan,
                                             verbose=True)

df = wikirepo.data.query(ents_dict=ents_dict, 
                         locations=us_counties_dict, depth=depth,
                         time_lvl=time_lvl, timespan=timespan,
                         demographic_props='population', 
                         economic_props=False, 
                         electoral_poll_props=False, 
                         electoral_result_props=False,
                         geographic_props='area', 
                         institutional_props='capital',
                         political_props=False,
                         misc_props=False,
                         verbose=True)

df[df['population'].notnull()].head(6)
location sub_lctn sub_sub_lctn qid year population area_km2 capital
United States of America California Alameda County Q107146 2018 1.6602e+06 2127 Oakland
United States of America California Contra Costa County Q108058 2018 1.14936e+06 2078 Martinez
United States of America California Marin County Q108117 2018 263886 2145 San Rafael
United States of America California Napa County Q108137 2018 141294 2042 Napa
United States of America California San Mateo County Q108101 2018 774155 1919 Redwood City
United States of America California Santa Clara County Q110739 2018 1.9566e+06 3377 San Jose

Upload Data

wikirepo.data.upload will be the core of the eventual wikirepo ELT process. The goal is to reocrd edits that a user makes to a prveviously queried dataframe such that these changes can then be pushed back to Wikidata. This process could be as simple as making changes to a df.copy() of a queried dataframe, and then using pandas to compare the new and original dataframes after the user has added information that they have access to. The unique information in the edited dataframe could then be loaded into Wikidata for all to use.

The same process that is used to query information from Wikidata could be reversed for the upload process. wikirepo/data property modules could all have a corresponding upload_property function that would link dataframe columns to their corresponding Wikidata properties, indicate if the time qualifiers are a points in time or spans using start time and end time, and other necessary qualifiers for proper data indexing could also be derived. Importantly, source information could also be added in corresponding columns to the given property edits (querying source columns for all data is a forthcoming feature).

Put simply: a fully featured wikirepo.data.upload function would realize the potential of single open-source repository for all social science information.

Maps

wikirepo/maps is a further goal of the project, as it combines wikirepo's focus on easy to access open source data and quick high level analytics.

Query Maps

As in wikirepo.data.query, passing the depth, locations, time_lvl and timespan arguments could access GeoJSON files stored on Wikidata, thus providing mapping files in parallel to the user's data. These files could then be leveraged using existing Python plotting libraries to provide detailed presentations of geographic analysis.

Upload Maps

Similar to the potential of adding statistics through wikirepo.data.upload, GeoJSON map files could also be uploaded to Wikidata using appropriate arguments. The potential exists for a myriad of variable maps given depth, locations, time_lvl and timespan information that would allow all wikirepo users to get the exact mapping file that they need for their given task.

To-Do

Expanding wikirepo's data infrastructure:

The growth of wikirepo's database relies on that of Wikidata. Beyond simply adding entries to already existing properties, the following are examples of property types that could be included:

  • Those for electoral polling and results for locations
    • This would allow direct access to all needed election information in a single function call
    • This data could be added to Wikidata sub-pages for locations
  • A property that links political parties and their regions in data/political
    • For easy professional presentation of electoral results (ex: loading in party hex colors, abbreviations, and alignments)
  • data/demographic properties such as:
    • age, education, religious, and linguistic diversities across time
  • data/economic properties such as:
    • female workforce participation, workforce industry diversity, wealth diversity, and total working age population across time
  • Distinct properties for Freedom House and Press Freedom indexes, as well as other descriptive metrics

Further ways to help:

  • Integrating current Python tools with wikirepo ETL structures for ELT uploads to Wikidata
  • Adding multiprocessing support to wikirepo.data.query and data.lctn_utils.gen_lctns_dict
  • Optimizing wikirepo.data.query:
    • Potentially converting EntitiesDict and LocationsDict to slotted object classes for memory savings
    • Deriving and optimizing other slow parts of the query process
  • Adding the access of GeoJSON files for mapping via wikirepo.maps.query
    • This would realize the potential of quick informative maps across the world
  • Creating and improving examples, as well as sharing them around the web
  • Testing for wikirepo
  • A read the docs page

Similar Packages

Python

JavaScript

Java


wikibase           wikidata

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikirepo-0.0.1.tar.gz (42.1 kB view hashes)

Uploaded Source

Built Distribution

wikirepo-0.0.1-py3-none-any.whl (59.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page