Mapping Wikidata and Wikipedia entities to each other

## Project description

This small Python library helps you to map Wikipedia page titles (e.g. Manatee to Q42797) and vice versa. This is done by creating an index of these mappings from a Wikipedia SQL dump. Precomputed indices can be found under Precomputed indices. Redirects are taken into account.

## Installation

This package can be installed via pip, the Python package manager.

pip install wikimapper

If all you want is just mapping, then you can also just download wikimapper/mapper.py and add it to your project. It does not have any external dependencies.

## Usage

Using the mapping functionality requires a precomputed index. It is created from Wikipedia SQL dumps (see Create your own index) or can be downloaded for certain languages (see Precomputed indices). For the following to work, it is assumed that an index either has been created or downloaded. Using the command line for batch mapping is not recommended, as it requires repeated opening and closing the database, leading to a speed penalty.

from wikimapper import WikiMapper

mapper = WikiMapper("index_enwiki-latest.db")
wikidata_id = mapper.title_to_id("Python_(programming_language)")
print(wikidata_id) # Q28865

or from the command line via

$wikimapper title2id index_enwiki-latest.db Germany Q183 ### Map Wikipedia URL to Wikidata id from wikimapper import WikiMapper mapper = WikiMapper("index_enwiki-latest.db") wikidata_id = mapper.url_to_id("https://en.wikipedia.org/wiki/Python_(programming_language)") print(wikidata_id) # Q28865 or from the command line via $ wikimapper url2id index_enwiki-latest.db https://en.wikipedia.org/wiki/Germany
Q183

It is not checked whether the URL origins from the same Wiki as the index you created!

from wikimapper import WikiMapper

mapper = WikiMapper("index_enwiki-latest.db")
titles = mapper.id_to_titles("Q183")
print(titles) # Germany, Deutschland, ...

or from the command line via

$wikimapper id2titles data/index_enwiki-latest.db Q183 Germany Bundesrepublik_Deutschland Land_der_Dichter_und_Denker Jerman ... Mapping id to title can lead to more than one result, as some pages in Wikipedia are redirects, all linking to the same Wikidata item. ### Create your own index While some indices are precomupted, it is sometimes useful to create your own. The following section describes the steps need. Regarding creation speed: The index creation code works, but is not optimized. It takes around 10 minutes on my Notebook (T480s) to create it for English Wikipedia if the data is already downloaded. 1. Download the data The easiest way is to use the command line tool that ships with this package. It can be e.g. invoked by $ wikimapper download enwiki-latest --dir data

The abbreviation for the Wiki of your choice can be found on Wikipedia. Available SQL dumps can be e.g. found on Wikimedia, you need to suffix the Wiki name, e.g. https://dumps.wikimedia.org/dewiki/ for the German one. If possible, use a different mirror than the default in order to spread the resource usage.

2. Create the index

The next step is to create an index from the downloaded dump. The easiest way is to use the command line tool that ships with this package. It can be e.g. invoked by

$wikimapper create enwiki-latest --dumpdir data --target data/index_enwiki-latest.db This creates an index for the previously downloaded dump and saves it in data/index_enwiki-latest.db. Use wikimapper create --help for a full description of the tool. ## Precomputed indices Several precomputed indices can be found here . ## Command line interface This package comes with a command line interface that is automatically available when installing via pip. It can be invoked by wikimapper from the command line. $ wikimapper

usage: wikimapper [-h] [--version]

positional arguments:
sub-command help
custom index.
title2id            Map a Wikipedia title to a Wikidata ID.
url2id              Map a Wikipedia URL to a Wikidata ID.
id2titles           Map a Wikidata ID to one or more Wikipedia titles.

optional arguments:
-h, --help            show this help message and exit
--version             show program's version number and exit

## Development

The required dependencies are managed by pip. A virtual environment containing all needed packages for development and production can be created and activated by

virtualenv venv --python=python3 --no-site-packages
source venv/bin/activate
pip install -e ".[test, dev, doc]"

The tests can be run in the current environment by invoking

make test

or in a clean environment via

tox

## FAQ

### How does the parsing of the dump work?

jamesmishra has noticed that SQL dumps from Wikipedia almost look like CSV. He provides some basic functions to parse insert statements into tuples. We then use the Wikipedia SQL page dump to get the mapping between title and internal id, page props to get the Wikidata ID for a title and then the redirect dump in order to fill titles that are only redirects and do not have an entry in the page props table.

### Why do you not use the Wikidata SPARQL endpoint for that?

It is possible to query the official Wikidata SPARQL endpoint to do the mapping:

prefix schema: <http://schema.org/>
SELECT * WHERE {
}

This has several issues: First, it uses the network, which is slow. Second, I try to use that endpoint as infrequent as possible to save their resources (my use case is to map data sets that have easily tens of thousands of entries). Third, I had coverage issues due to redirects in Wikipedia not being resolved (around ~20% of the time for some older data sets). So I created this package to do the mapping offline instead.

## Acknowledgements

I am very thankful for jamesmishra to provide mysqldump-to-csv . Also, mbugert helped me tremendously understanding Wikipedia dumps and giving me the idea on how to map.

## Project details

Uploaded source