Skip to main content

A toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs.

Project description

SRTK: Subgraph Retrieval Toolkit

PyPi Documentation Status PytestStatus License: MIT DOI

SRTK is a toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs. It currently supports Wikidata, Freebase and DBpedia.

A minimum walkthrough of the retrieve process:

retrieve example

Visualized subgraph

Prerequisite

Installations

pip install srtk

Local Deployment of Knowledge Graphs

Usage

There are mainly five subcommands of SRTK, which covers the whole pipeline of subgraph retrieval.

For retrieval:

  • srtk link: Link entity mentions in texts to a knowledge graph. Currently Wikidata and DBpedia are supported out of the box.
  • srtk retrieve: Retrieve semantic-relevant subgraphs from a knowledge graph with a trained retriever. It can also be used to evaluate a trained retriever.
  • srtk visualize: Visualize retrieved subgraphs using a graph visualization tool.

For training a retriever:

  • srtk preprocess: Preprocess a dataset for training a subgraph retrieval model.
  • srtk train: Train a subgraph retrieval model on a preprocessed dataset.

Use srtk [subcommand] --help to see the detailed usage of each subcommand.

Walkthrough

Retrieve Subgraphs

Retrieve subgraphs with a trained scorer

srtk retrieve [-h] -i INPUT -o OUTPUT [-e SPARQL_ENDPOINT] -kg {freebase,wikidata}
              -m SCORER_MODEL_PATH [--beam-width BEAM_WIDTH] [--max-depth MAX_DEPTH]
              [--evaluate] [--include-qualifiers]

The scorer-model-path argument can be any huggingface pretrained encoder model. If it is a local path, please ensure the tokenizer is also saved along with the model.

Visualize retrieved subgraph

srtk visualize [-h] -i INPUT -o OUTPUT_DIR [-e SPARQL_ENDPOINT]
               [-kg {wikidata,freebase}] [--max-output MAX_OUTPUT]

Train a Retriever

A scorer is the model used to navigate the expanding path. At each expanding step, relations scored higher with scorer are picked as relations for the next hop.

The score is based on the embedding similarity of the to-be-expanded relation with the query (question + previous expanding path).

The model is trained in a distant supervised learning fashion. Given the question entities and the answer entities, the model uses the shortest paths along them as the supervision signal.

Preprocess a dataset

  1. prepare training samples where question entities and answer entities are know.

    The training data should be saved in a jsonl file (e.g. data/grounded.jsonl). Each training sample should come with the following format:

    {
      "id": "sample-id",
      "question": "Which universities did Barack Obama graduate from?",
      "question_entities": [
        "Q76"
      ],
      "answer_entities": [
        "Q49122",
        "Q1346110",
        "Q4569677"
      ]
    }
    
  2. Preprocess the samples with srtk preprocess command.

    srtk preprocess [-h] -i INPUT -o OUTPUT [--intermediate-dir INTERMEDIATE_DIR]
                    -e SPARQL_ENDPOINT -kg {wikidata,freebase} [--search-path]
                    [--metric {jaccard,recall}] [--num-negative NUM_NEGATIVE]
                    [--positive-threshold POSITIVE_THRESHOLD]
    

    Under the hood, it does four things:

    1. Find the shortest paths between the question entities and the answer entities.
    2. Score the searched paths with Jaccard scores with the answers.
    3. Negative sampling. At each expanding step, the negative samples are those false relations connected to the tracked entities.
    4. Generate training dataset as a jsonl file.

Train a sentence encoder

The scorer should be initialized from a pretrained encoder model from huggingface hub. Here I used intfloat/e5-small, which is a checkpoint of the BERT model.

srtk train --data-file data/train.jsonl \
    --model-name-or-path intfloat/e5-small \
    --save-model-path artifacts/scorer

Tutorials

License

This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srtk-0.0.5.tar.gz (149.4 kB view details)

Uploaded Source

Built Distribution

srtk-0.0.5-py3-none-any.whl (41.6 kB view details)

Uploaded Python 3

File details

Details for the file srtk-0.0.5.tar.gz.

File metadata

  • Download URL: srtk-0.0.5.tar.gz
  • Upload date:
  • Size: 149.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for srtk-0.0.5.tar.gz
Algorithm Hash digest
SHA256 4c4941c3f672d7a7be2209b95e39f45aac31c5591d3d07ed1b4312114d8e3ec6
MD5 810fef10d1172a4672a159290cbbbf6d
BLAKE2b-256 e97b625a4d7a06c0b89f2a75f90624b07d519df6a8ca8e416325e9aa8201d0f8

See more details on using hashes here.

File details

Details for the file srtk-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: srtk-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 41.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for srtk-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 25ead439810a857b315a7f84545cf42c2a2c5b778050f8b29c394354506bc32b
MD5 b976cc6a529e288cdbe47e2e0e167a5e
BLAKE2b-256 5a70d6ede3bca54646dc5299f3e7a81a80e19bd29bb3af967a6d0bf064dbce4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page