Skip to main content

Library to process dumps of knowledge graphs (Wikipedia, DBpedia, Wikidata)

Project description

KGData is a library to process dumps of Wikipedia, Wikidata and DBPedia.

PyPI

Contents

Usage

Wikidata

Usage: python -m kgdata.wikidata [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  classes              Wikidata classes
  entities             Wikidata entities
  entity_labels        Wikidata entity labels
  entity_redirections  Wikidata entity redirections
  properties           Wikidata properties
  wp2wd                Mapping from Wikipedia articles to Wikidata entities

You need the following dumps:

  1. entity dump (latest-all.json.bz2): needed to extract entities, classes and properties.
  2. wikidatawiki-page.sql.gz and wikidatawiki-redirect.sql.gz (link): needed to extract redirections of old entities.

Then, execute the following steps:

  1. Download the wikidata dumps (e.g., latest-all.json.bz2) and put it to <wikidata_dir>/step_0 folder.
  2. Extract entities, entity Labels, and entity redirections:
    • kgdata wikidata entities -d <wikidata_dir> -o <database_directory> -c
    • kgdata wikidata entity_labels -d <wikidata_dir> -o <database_directory> -c
    • kgdata wikidata entity_redirections -d <wikidata_dir> -o <database_directory> -c
  3. Extract ontology:
    • kgdata wikidata classes -d <wikidata_dir> -o <database_directory> -c
    • kgdata wikidata properties -d <wikidata_dir> -o <database_directory> -c

For more commands, see scripts/build.sh. If compaction step (compact rocksdb) takes lots of time, you can run without -c flag. If you run directly from source, replacing the kgdata command with python -m kgdata.

We provide functions to read the databases built from the previous step and return a dictionary-like objects in the module: kgdata.wikidata.db. In the same folder, you can find models of Wikidata entities, classes, and properties.

Wikipedia

Here is a list of dumps that you need to download depending on the database/files you want to build:

  1. Static HTML Dumps: they only dumps some namespaces. The namespace that you likely to use is 0 (main articles). For example, enwiki-NS0-20220420-ENTERPRISE-HTML.json.tar.gz.

Then, execute the following steps:

  1. Extract HTML Dumps:
    • kgdata wikipedia -d <wikipedia_dir> enterprise_html_dumps

Installation

From pip

You need to have gcc in order to install cityhash

pip install kgdata

From Source

This library uses Apache Spark 3.0.3 (pyspark version is 3.0.3). If you use different Spark version, make sure that version of pyspark package is matched (in pyproject.toml).

poetry install
mkdir dist; zip -r kgdata.zip kgdata; mv kgdata.zip dist/ # package the application to submit to Spark cluster

You can also consult the Dockerfile for guidance to install from scratch.

Project details


Release history Release notifications | RSS feed

This version

2.3.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kgdata-2.3.3.tar.gz (83.0 kB view hashes)

Uploaded Source

Built Distribution

kgdata-2.3.3-py3-none-any.whl (112.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page