Skip to main content

A toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs.

Project description

Subgraph Retrieval Toolkit

PytestStatus License: MIT

Retrieve subgraphs on Wikidata. The method is based on this retrieval work for Freebase.

Prerequisite

Install SRTK

pip install srtk

Wikidata

Deploy a Wikidata endpoint locally

We use qEndpoint to spin up a Wikidata endpoint that contains a Wikidata Truthy dump.

  • Download

    sudo docker run -p 1234:1234 --name qendpoint-wikidata qacompany/qendpoint-wikidata
    
  • Run

    sudo docker start  qendpoint-wikidata
    
  • Add Wikidata prefixes support

    wget https://raw.githubusercontent.com/the-qa-company/qEndpoint/master/wikibase/prefixes.sparql
    sudo docker cp prefixes.sparql qendpoint-wikidata:/app/qendpoint && rm prefixes.sparql
    

Alternatively, you can also use an online Wikidata endpoint, e.g. https://query.wikidata.org/sparql

Deploy a REL endpoint for entity linking (only necessary for end-to-end inference)

Please refer to this tutorial for REL endpoint deployment: End-to-End Entity Linking

Freebase

Deploy a Freebase endpoint locally

Please refer to dki-lab/Freebase-Setup for the setup.

# Download setup script
git clone https://github.com/dki-lab/Freebase-Setup.git && cd Freebase-Setup
# Download virtuoso binary
wget https://kumisystems.dl.sourceforge.net/project/virtuoso/virtuoso/7.2.5/virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
tar -zxvf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz && rm virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
# Replace the virtuoso path in virtuoso.py
sed -i 's/\/home\/dki_lab\/tools\/virtuoso\/virtuoso-opensource/\.\/virtuoso-opensource/g' virtuoso.py
# Download Freebase dump
wget https://www.dropbox.com/s/q38g0fwx1a3lz8q/virtuoso_db.zip
unzip virtuoso_db.zip && rm virtuoso_db.zip
# Start virtuoso
python3 virtuoso.py start 3001 -d virtuoso_db

Retrieve subgraphs with a trained scorer

srtk retrieve --sparql-endpoint WIKIDATA_ENDPOINT \
    -kg wikidata
    --scorer-model-path path/to/scorer \
    --input data/ground.jsonl \
    --output-path data/subgraph.jsonl \
    --beam-width 10

The scorer-model-path argument can be any huggingface pretrained encoder model. If it is a local path, please ensure the tokenizer is also saved along with the model.

Visualize retrieved subgraph

srtk visualize --sparql-endpoint WIKIDATA_ENDPOINT \
    --knowledge-graph wikidata \
    --input data/subgraph.jsonl \
    --output-dir ./htmls/

Train a scorer

A scorer is the model used to navigate the expanding path. At each expanding step, relations scored higher with scorer are picked as relations for the next hop.

The score is based on the embedding similarity of the to-be-expanded relation with the query (question + previous expanding path).

The model is trained in a distant supervised learning fashion. Given the question entities and the answer entities, the model uses the shortest paths along them as the supervision signal.

Preprocess a dataset

  1. prepare training samples where question entities and answer entities are know.

    The training data should be saved in a jsonl file (e.g. data/grounded.jsonl). Each training sample should come with the following format:

    {
      "id": "sample-id",
      "question": "Which universities did Barack Obama graduate from?",
      "question_entities": [
        "Q76"
      ],
      "answer_entities": [
        "Q49122",
        "Q1346110",
        "Q4569677"
      ]
    }
    
  2. Preprocess the samples with srtk preprocess command.

    srtk preprocess --sparql-endpoint WIKIDATA_ENDPOINT \
        -kg wikidata \
        --input-file data/grounded.jsonl \
        --output-dir data/retrieved --metric jaccard
    

    Under the hood, it does four things:

    1. Find the shortest paths between the question entities and the answer entities.
    2. Score the searched paths with Jaccard scores with the answers.
    3. Negative sampling. At each expanding step, the negative samples are those false relations connected to the tracked entities.
    4. Generate training dataset as a jsonl file.

Train a sentence encoder

The scorer should be initialized from a pretrained encoder model from huggingface hub. Here I used intfloat/e5-small, which is a checkpoint of the BERT model.

srtk train --data-file data/train.jsonl \
    --model-name-or-path intfloat/e5-small \
    --save-model-path artifacts/scorer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srtk-0.0.1.tar.gz (56.4 kB view details)

Uploaded Source

Built Distribution

srtk-0.0.1-py3-none-any.whl (33.7 kB view details)

Uploaded Python 3

File details

Details for the file srtk-0.0.1.tar.gz.

File metadata

  • Download URL: srtk-0.0.1.tar.gz
  • Upload date:
  • Size: 56.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.15

File hashes

Hashes for srtk-0.0.1.tar.gz
Algorithm Hash digest
SHA256 242d0f9e8f90322dc94f849b8cd6f410771038158dd5164cb5e7e5ade3b35709
MD5 0369a175544fc51c55d4bcdd0c86a5f0
BLAKE2b-256 acbff4763d2ca854c1bed53e91de7907fed0757f10a50373f97f611a5f52749a

See more details on using hashes here.

File details

Details for the file srtk-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: srtk-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 33.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.15

File hashes

Hashes for srtk-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd00fc9a067717cf8af5c143893a599395da11069741921cfaf473e0f0511771
MD5 aceee9bd4c2bca0323037c24906ac713
BLAKE2b-256 0662f409e7b7e60f40b7e2848a121d5889c7f06e3031b17f57996c454057e762

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page