elastic-wikidata

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Elastic Wikidata

Simple CLI tools to load a subset of Wikidata into Elasticsearch. Part of the Heritage Connector project.

Why?
Installation
Setup
Usage

PyPI - Downloads GitHub last commit GitHub Pipenv locked Python version

Why?

Running text search programmatically on Wikidata means using the MediaWiki query API, either directly or through the Wikidata query service/SPARQL.

There are a couple of reasons you may not want to do this when running searches programmatically:

time constraints/large volumes: APIs are rate-limited, and you can only do one text search per SPARQL query
better search: using Elasticsearch allows for more flexible and powerful text search capabilities.^* We're using our own Elasticsearch instance to do nearest neighbour search on embeddings, too.

^* CirrusSearch is a Wikidata extension that enables direct search on Wikidata using Elasticsearch, if you require powerful search and are happy with the rate limit.

Installation

from pypi: pip install elastic_wikidata

from repo:

Download
cd into root
pip install -e .

Setup

elastic-wikidata needs the Elasticsearch credentials ELASTICSEARCH_CLUSTER, ELASTICSEARCH_USER and ELASTICSEARCH_PASSWORD to connect to your ES instance. You can set these in one of three ways:

Using environment variables: export ELASTICSEARCH_CLUSTER=https://... etc
Using config.ini: pass the -c parameter followed by a path to an ini file containing your Elasticsearch credentials. Example here.
Pass each variable in at runtime using options --cluster/-c, --user/-u, --password/-p.

Usage

Once installed the package is accessible through the keyword ew. A call is structured as follows:

ew <task> <options>

Task is either:

dump: load data from Wikidata JSON dump, or
query: load data from SPARQL query.

A full list of options can be found with ew --help, but the following are likely to be useful:

--index/-i: the index name to push to. If not specified at runtime, elastic-wikidata will prompt for it
--limit/-l: limit the number of records pushed into ES. You might want to use this for a small trial run before importing the whole thing.
--properties/-prop: a whitespace-separated list of properties to include in the ES index e.g. 'p31 p21', or the path to a text file containing newline-separated properties e.g. this one.
--language/-lang: Wikimedia language code. Only one supported at this time.

Loading from Wikidata dump (.ndjson)

ew dump -p <path_to_json> <other_options>

This is useful if you want to create one or more large subsets of Wikidata in different Elasticsearch indexes (millions of entities).

Time estimate: Loading all ~8million humans into an AWS Elasticsearch index took me about 20 minutes. Creating the humans subset using wikibase-dump-filter took about 3 hours using its instructions for parallelising.

Download the complete Wikidata dump (latest-all.json.gz from here). This is a large file: 87GB on 07/2020.
Use maxlath's wikibase-dump-filter to create a subset of the Wikidata dump. Note: don't use the --simplify flag when running the dump. elastic-wikidata will take care of simplification.
Run ew dump with flag -p pointing to the JSON subset. You might want to test it with a limit (using the -l flag) first.

Loading from SPARQL query

ew query -p <path_to_sparql_query> <other_options>

For smaller collections of Wikidata entities it might be easier to populate an Elasticsearch index directly from a SPARQL query rather than downloading the whole Wikidata dump to take a subset. ew query automatically paginates SPARQL queries so that a heavy query like 'return all the humans' doesn't result in a timeout error.

Time estimate: Loading 10,000 entities into Wikidata into an AWS hosted Elasticsearch index took me about 6 minutes.

Write a SPARQL query and save it to a text/.rq file. See example.
Run ew query with the -p option pointing to the file containing the SPARQL query. Optionally add a --page_size for the SPARQL query.

Temporary side effects

As of version 0.3.1 refreshing the search index is disabled for the duration of load by default, as recommended by ElasticSearch. Refresh is re-enabled to the default interval of 1s after load is complete. To disable this behaviour use the flag --no_disable_refresh/-ndr.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.1

Feb 3, 2022

1.0.0

May 20, 2021

0.3.7

Dec 22, 2020

0.3.6

Oct 9, 2020

0.3.5

Oct 7, 2020

0.3.4

Oct 7, 2020

0.3.3

Sep 30, 2020

0.3.2

Sep 15, 2020

0.3.1

Sep 11, 2020

0.3.0

Aug 26, 2020

0.2.2

Aug 25, 2020

0.2.1

Aug 19, 2020

0.2.0

Aug 5, 2020

0.1.0

Aug 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elastic-wikidata-1.0.1.tar.gz (15.2 kB view details)

Uploaded Feb 3, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

elastic_wikidata-1.0.1-py3-none-any.whl (15.4 kB view details)

Uploaded Feb 3, 2022 Python 3

File details

Details for the file elastic-wikidata-1.0.1.tar.gz.

File metadata

Download URL: elastic-wikidata-1.0.1.tar.gz
Upload date: Feb 3, 2022
Size: 15.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for elastic-wikidata-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`a969b7e36ccf38ba2cede7b206b33fad80a269cc31b3e5ae8755426690ccf65f`
MD5	`87124c5d47c2b7ba0193c49c0f6af22d`
BLAKE2b-256	`d2df440414c35e2f0a8e7a7457f6ba18736ca36f94d788265bd8eaf78adcc6a8`

See more details on using hashes here.

File details

Details for the file elastic_wikidata-1.0.1-py3-none-any.whl.

File metadata

Download URL: elastic_wikidata-1.0.1-py3-none-any.whl
Upload date: Feb 3, 2022
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for elastic_wikidata-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2d52ebbe03d02157cb51feddddc2a4924338c4685d6a454e994cac039814fe35`
MD5	`6ffe88f5e73c484c19e6a5ccf28ba086`
BLAKE2b-256	`39eaeb204023f7e3eeb90585abb16926285951ddb66c6673af64c304f083a54e`

See more details on using hashes here.

elastic-wikidata 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Elastic Wikidata

Why?

Installation

Setup

Usage

Loading from Wikidata dump (.ndjson)

Loading from SPARQL query

Temporary side effects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes