kgdata

Library to process dumps of knowledge graphs (Wikipedia, DBpedia, Wikidata)

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

KGData is a library to process dumps of Wikipedia, Wikidata and DBPedia.

PyPI

Usage
- Wikidata
- Wikipedia
Installation

Usage

Wikidata

Usage: python -m kgdata.wikidata [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  classes              Wikidata classes
  entities             Wikidata entities
  entity_labels        Wikidata entity labels
  entity_redirections  Wikidata entity redirections
  properties           Wikidata properties
  wp2wd                Mapping from Wikipedia articles to Wikidata entities

You need the following dumps:

entity dump (latest-all.json.bz2): needed to extract entities, classes and properties.
wikidatawiki-page.sql.gz and wikidatawiki-redirect.sql.gz (link): needed to extract redirections of old entities.

Then, execute the following steps:

Download the wikidata dumps (e.g., latest-all.json.bz2) and put it to <wikidata_dir>/step_0 folder.
Extract entities, entity Labels, and entity redirections:
- kgdata wikidata entities -d <wikidata_dir> -o <database_directory> -c
- kgdata wikidata entity_labels -d <wikidata_dir> -o <database_directory> -c
- kgdata wikidata entity_redirections -d <wikidata_dir> -o <database_directory> -c
Extract ontology:
- kgdata wikidata classes -d <wikidata_dir> -o <database_directory> -c
- kgdata wikidata properties -d <wikidata_dir> -o <database_directory> -c

For more commands, see scripts/build.sh. If compaction step (compact rocksdb) takes lots of time, you can run without -c flag. If you run directly from source, replacing the kgdata command with python -m kgdata.

We provide functions to read the databases built from the previous step and return a dictionary-like objects in the module: kgdata.wikidata.db. In the same folder, you can find models of Wikidata entities, classes, and properties.

Wikipedia

Here is a list of dumps that you need to download depending on the database/files you want to build:

Static HTML Dumps: they only dumps some namespaces. The namespace that you likely to use is 0 (main articles). For example, enwiki-NS0-20220420-ENTERPRISE-HTML.json.tar.gz.

Then, execute the following steps:

Extract HTML Dumps:
- kgdata wikipedia -d <wikipedia_dir> enterprise_html_dumps

Installation

From pip

You need to have gcc in order to install cityhash

pip install kgdata

From Source

This library uses Apache Spark 3.0.3 (pyspark version is 3.0.3). If you use different Spark version, make sure that version of pyspark package is matched (in pyproject.toml).

poetry install
mkdir dist; zip -r kgdata.zip kgdata; mv kgdata.zip dist/ # package the application to submit to Spark cluster

You can also consult the Dockerfile for guidance to install from scratch.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

7.0.4

May 11, 2024

7.0.2

May 11, 2024

7.0.1

Apr 26, 2024

7.0.0

Mar 28, 2024

6.5.2

Mar 8, 2024

6.5.1

Mar 6, 2024

6.5.0

Mar 4, 2024

6.4.1

Mar 2, 2024

6.3.1

Jan 19, 2024

6.3.0

Jan 16, 2024

6.2.1

Jan 8, 2024

6.2.0

Jan 4, 2024

6.1.0

Dec 30, 2023

6.0.1

Dec 25, 2023

6.0.0

Dec 5, 2023

5.6.1

Dec 5, 2023

5.6.0

Dec 4, 2023

5.5.1

Nov 30, 2023

5.4.0

Nov 21, 2023

5.3.6

Nov 17, 2023

5.3.5

Nov 9, 2023

5.3.4

Nov 9, 2023

5.3.3

Nov 7, 2023

5.3.2

Nov 6, 2023

5.3.0

Oct 30, 2023

5.2.1

Oct 22, 2023

5.1.1

Oct 2, 2023

5.1.0

Oct 1, 2023

5.0.0

Sep 22, 2023

5.0.0a9 pre-release

Sep 10, 2023

5.0.0a8 pre-release

Sep 8, 2023

5.0.0a7 pre-release

Sep 7, 2023

5.0.0a6 pre-release

Sep 7, 2023

4.1.2

Aug 20, 2023

4.1.1

Aug 20, 2023

4.1.0

Aug 16, 2023

4.0.6

Aug 11, 2023

4.0.3

Aug 4, 2023

4.0.2

Aug 4, 2023

4.0.1

Jul 27, 2023

3.9.2

Jun 26, 2023

3.9.0

Jun 22, 2023

3.8.0

Jun 15, 2023

3.7.2

Jun 11, 2023

3.7.1

Jun 5, 2023

3.7.0

Jun 5, 2023

3.4.2

Feb 23, 2023

3.3.1

Dec 26, 2022

3.2.0

Oct 19, 2022

3.1.1

Oct 14, 2022

3.1.0

Oct 3, 2022

3.0.5

Oct 3, 2022

3.0.4

Sep 27, 2022

3.0.3

Sep 6, 2022

3.0.2

Sep 5, 2022

3.0.1

Sep 4, 2022

3.0.0

Sep 1, 2022

This version

2.3.3

Jul 6, 2022

2.2.1

May 26, 2022

2.2.0

May 25, 2022

2.1.0

May 22, 2022

2.0.6

May 21, 2022

2.0.0a0 pre-release

May 18, 2022

1.7.3

May 14, 2022

1.7.2

May 10, 2022

1.7.1

Apr 21, 2022

1.7.0

Apr 6, 2022

1.6.3

Apr 5, 2022

1.6.2

Mar 26, 2022

1.6.0

Jan 25, 2022

1.3.0

Nov 16, 2021

1.2.5

Oct 28, 2021

1.2.4

Sep 29, 2021

1.2.3

Aug 28, 2021

1.2.2

Aug 25, 2021

1.2.1

Aug 21, 2021

1.1.9

Aug 17, 2021

1.1.8

Jun 14, 2021

1.1.7

May 28, 2021

1.1.2

May 24, 2021

1.1.1

May 24, 2021

1.1.0

May 23, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kgdata-2.3.3.tar.gz (83.0 kB view hashes)

Uploaded Jul 6, 2022 Source

Built Distribution

kgdata-2.3.3-py3-none-any.whl (112.0 kB view hashes)

Uploaded Jul 6, 2022 Python 3

Hashes for kgdata-2.3.3.tar.gz

Hashes for kgdata-2.3.3.tar.gz
Algorithm	Hash digest
SHA256	`ae3c5a597a5bd90b4097a72b8e4f7282792de4cfeee7cd61c2392d92eeed0b47`
MD5	`e488e6427f7144d3f1614a0e29cffac2`
BLAKE2b-256	`dcd3a66329605bd9a7417f951668c6a99e6a7a70e5428939b324338a27e080ad`

Hashes for kgdata-2.3.3-py3-none-any.whl

Hashes for kgdata-2.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b88ed9b4a1b26dd0dda37be17c6f3dfc7d03d007856ccf94362888203d37c754`
MD5	`933d01d648e50912eb7b8d49e6eff5ac`
BLAKE2b-256	`87f927d17e9c190f8ab6b1118967888c297cf544c4c8112a915b40a3cb77317f`