Library to process dumps of knowledge graphs (Wikipedia, DBpedia, Wikidata)
Project description
KGData is a library to process dumps of knowledge graphs.
Usage
Wikidata
Usage: python -m kgdata.cli wikidata [OPTIONS]
Options:
-b, --build TEXT Build database
-d, --directory TEXT Wikidata directory
-o, --output TEXT Output directory
-c, --compact Whether to compact the db after extraction
--help Show this message and exit.
You need the following dumps:
- entity dump (
latest-all.json.bz2
): needed to extract qnodes, classes and properties. wikidatawiki-page.sql.gz
andwikidatawiki-redirect.sql.gz
(link): needed to extract redirections between qnodes.
Then, execute the following steps:
- Download the wikidata dumps (e.g.,
latest-all.json.bz2
) and put it to<wikidata_dir>/step_0
folder. - Extract Qnodes, Qnode Labels, and Qnode Redirections:
kgdata wikidata -d <wikidata_dir> -b qnodes -o <database_directory> -c
kgdata wikidata -d <wikidata_dir> -b qnode_labels -o <database_directory> -c
kgdata wikidata -d <wikidata_dir> -b qnode_redirections -o <database_directory> -c
- Extract ontology:
kgdata wikidata -d <wikidata_dir> -b wdclasses -o <database_directory> -c
kgdata wikidata -d <wikidata_dir> -b wdprops -o <database_directory> -c
For more commands, see scripts/build.sh
.
If compaction step (compact rocksdb) takes lots of time, you can run without -c
flag.
If you run directly from source, replacing the kgdata
command with python -m kgdata
.
We provide functions to read the databases built from the previous step and return a dictionary-like objects in the module: kgdata.wikidata.db
. Inside the same folder, you can find models of Wikidata entities, classes, and properties.
Installation
From pip
You need to have gcc in order to install cityhash
pip install kgdata
From Source
This library uses Apache Spark 3.0.3 (pyspark
version is 3.0.3
). If you use different Spark version, make sure that version of pyspark
package is matched (in pyproject.toml
).
poetry install
mkdir dist; zip -r kgdata.zip kgdata; mv kgdata.zip dist/ # package the application to submit to Spark cluster
You can also consult the Dockerfile for guidance to install from scratch.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.