A toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs.
Project description
Subgraph Retrieval Toolkit
Retrieve subgraphs on Wikidata. The method is based on this retrieval work for Freebase.
Prerequisite
Install SRTK
pip install srtk
Wikidata
Deploy a Wikidata endpoint locally
We use qEndpoint to spin up a Wikidata endpoint that contains a Wikidata Truthy dump.
-
Download
sudo docker run -p 1234:1234 --name qendpoint-wikidata qacompany/qendpoint-wikidata
-
Run
sudo docker start qendpoint-wikidata
-
Add Wikidata prefixes support
wget https://raw.githubusercontent.com/the-qa-company/qEndpoint/master/wikibase/prefixes.sparql sudo docker cp prefixes.sparql qendpoint-wikidata:/app/qendpoint && rm prefixes.sparql
Alternatively, you can also use an online Wikidata endpoint, e.g. https://query.wikidata.org/sparql
Deploy a REL endpoint for entity linking (only necessary for end-to-end inference)
Please refer to this tutorial for REL endpoint deployment: End-to-End Entity Linking
Freebase
Deploy a Freebase endpoint locally
Please refer to dki-lab/Freebase-Setup for the setup.
# Download setup script
git clone https://github.com/dki-lab/Freebase-Setup.git && cd Freebase-Setup
# Download virtuoso binary
wget https://kumisystems.dl.sourceforge.net/project/virtuoso/virtuoso/7.2.5/virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
tar -zxvf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz && rm virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
# Replace the virtuoso path in virtuoso.py
sed -i 's/\/home\/dki_lab\/tools\/virtuoso\/virtuoso-opensource/\.\/virtuoso-opensource/g' virtuoso.py
# Download Freebase dump
wget https://www.dropbox.com/s/q38g0fwx1a3lz8q/virtuoso_db.zip
unzip virtuoso_db.zip && rm virtuoso_db.zip
# Start virtuoso
python3 virtuoso.py start 3001 -d virtuoso_db
Retrieve subgraphs with a trained scorer
srtk retrieve --sparql-endpoint WIKIDATA_ENDPOINT \
-kg wikidata
--scorer-model-path path/to/scorer \
--input data/ground.jsonl \
--output-path data/subgraph.jsonl \
--beam-width 10
The scorer-model-path
argument can be any huggingface pretrained encoder model. If it is a local
path, please ensure the tokenizer is also saved along with the model.
Visualize retrieved subgraph
srtk visualize --sparql-endpoint WIKIDATA_ENDPOINT \
--knowledge-graph wikidata \
--input data/subgraph.jsonl \
--output-dir ./htmls/
Train a scorer
A scorer is the model used to navigate the expanding path. At each expanding step, relations scored higher with scorer are picked as relations for the next hop.
The score is based on the embedding similarity of the to-be-expanded relation with the query (question + previous expanding path).
The model is trained in a distant supervised learning fashion. Given the question entities and the answer entities, the model uses the shortest paths along them as the supervision signal.
Preprocess a dataset
-
prepare training samples where question entities and answer entities are know.
The training data should be saved in a jsonl file (e.g.
data/grounded.jsonl
). Each training sample should come with the following format:{ "id": "sample-id", "question": "Which universities did Barack Obama graduate from?", "question_entities": [ "Q76" ], "answer_entities": [ "Q49122", "Q1346110", "Q4569677" ] }
-
Preprocess the samples with
srtk preprocess
command.srtk preprocess --sparql-endpoint WIKIDATA_ENDPOINT \ -kg wikidata \ --input-file data/grounded.jsonl \ --output-dir data/retrieved --metric jaccard
Under the hood, it does four things:
- Find the shortest paths between the question entities and the answer entities.
- Score the searched paths with Jaccard scores with the answers.
- Negative sampling. At each expanding step, the negative samples are those false relations connected to the tracked entities.
- Generate training dataset as a jsonl file.
Train a sentence encoder
The scorer should be initialized from a pretrained encoder model from huggingface hub. Here I used intfloat/e5-small
, which is a checkpoint of the BERT model.
srtk train --data-file data/train.jsonl \
--model-name-or-path intfloat/e5-small \
--save-model-path artifacts/scorer
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file srtk-0.0.1.tar.gz
.
File metadata
- Download URL: srtk-0.0.1.tar.gz
- Upload date:
- Size: 56.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 242d0f9e8f90322dc94f849b8cd6f410771038158dd5164cb5e7e5ade3b35709 |
|
MD5 | 0369a175544fc51c55d4bcdd0c86a5f0 |
|
BLAKE2b-256 | acbff4763d2ca854c1bed53e91de7907fed0757f10a50373f97f611a5f52749a |
File details
Details for the file srtk-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: srtk-0.0.1-py3-none-any.whl
- Upload date:
- Size: 33.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd00fc9a067717cf8af5c143893a599395da11069741921cfaf473e0f0511771 |
|
MD5 | aceee9bd4c2bca0323037c24906ac713 |
|
BLAKE2b-256 | 0662f409e7b7e60f40b7e2848a121d5889c7f06e3031b17f57996c454057e762 |