Skip to main content

Inductive Reasoning with Text - Benchmarks

Project description

Inductive Reasoning with Text (IRT)

Code style: black PyPI version

Table of Contents

This code is used to create benchmark datasets as described in Open-World Knowledge Graph Completion Benchmarks for Knowledge Discovery from a given knowledge graph (i.e. triple set) and supplementary text. The two KG's evaluated in the paper (based on FB15k237 and CoDEx) are available for download below.

Download

We offer two IRT reference datasets: The first - IRT-FB - is based on FB15k237 and the second - IRT-CDE - utilizes CoDEx. Each dataset offers knowledge graph triples for the closed world (cw) and open world (ow) split. The ow-split is partitioned into validation and test data. Each entity of the KG is assigned a set of text contexts of mentions of that entity.

Name Description Download
IRT-CDE Based on CoDEx Link
IRT-FB Based on FB15k237 Link

Installation

Python 3.9 is required. We recommend miniconda for managing Python environments.

conda create -n irt python=3.9
conda activate irt
pip install irt-data

The requirements.txt contains additional packages used for development.

Loading

Simply provide a path to an IRT dataset folder. The data is loaded lazily - that is why the construction is fast, but the first invocation of .description takes a while.

from irt import Dataset
dataset = Dataset('path/to/irt-fb')
print(dataset.description)
IRT DATASET

IRT GRAPH: irt-fb
  nodes: 14541
  edges: 310116 (237 types)
  degree:
    mean 42.65
    median 26

IRT SPLIT
2389 retained concepts

Config:
  seed: 26041992
  ow split: 0.7
  ow train split: 0.5
  relation threshold: 100
  git: 66fe7bd3c934311bdc3b1aa380b7c6c45fd7cd93
  date: 2021-07-21 17:29:04.339909

Closed World - TRAIN:
  owe: 12164
  entities: 12164
  heads: 11562
  tails: 11252
  triples: 238190

Open World - VALID:
  owe: 1558
  entities: 9030
  heads: 6907
  tails: 6987
  triples: 46503

Open World - TEST:
  owe: 819
  entities: 6904
  heads: 4904
  tails: 5127
  triples: 25423

IRT Text (Mode.CLEAN)
  mean contexts: 28.92
  median contexts: 30.00
  mean mentions: 2.84
  median mentions: 2.00

Data Formats

The data in the respective provided dataset folders should be quite self-explanatory. Each entity and each relation is assigned a unique integer id (denoted e [entity], h [head], t [tail], and r [relation]). There is folder containing the full graph data (graph/), a folder containing the open-world/closed-world splits (split/) and the textual data (text/).

Graph Data

This concerns both data in graph/ and split/. Entity and relation identifier can be translated with the graph/entities.txt and graph/relations.txt respectively. Triple sets come in h t r order. Reference code to load graph data:

  • irt.graph.Graph.load
  • irt.data.dataset.Split.load

Text Data

The upstream system that sampled our texts: ecc. All text comes gzipped and can be opened using the built-in python gzip library. For inspection, you can use the zcat, zless, zgrep, etc. (at least on unixoid systems ;)) - or extract them using unzip. Reference code to load text data:

  • irt.data.dataset.Text.load

PyKEEN Dataset

For users of pykeen. There are two "views" on the triple sets: closed-world and open-world. Both simply offer pykeen TriplesFactories with an id-mapping to the IRT entity-ids.

Closed-World:

from irt import Dataset
from irt import KeenClosedWorld

dataset = Dataset('path/to/dataset')

# 'split' is either a single float, a tuple (for an additional
# test split) or a triple which must sum to 1
kcw = KeenClosedWorld(dataset=dataset, split=.8, seed=1234)

print(kcw.description)
IRT PYKEEN DATASET
irt-cde

  training triples factory:
    entities: 12091
    relations: 51
    triples: 109910

  validation triples factory:
    entities: 12091
    relations: 51
    triples: 27478

It offers .training, .validation, and .testing TriplesFactories, and irt2keen/keen2irt id-mappings.

Open-World:

from irt import Dataset
from irt import KeenClosedWorld

dataset = Dataset('path/to/dataset')
kow = KeenOpenWorld(dataset=ds)

print(kow.description)
IRT PYKEEN DATASET
irt-cde

  closed world triples factory:
    entities: 12091
    relations: 51
    triples: 137388

  open world validation triples factory:
    entities: 15101
    relations: 46
    triples: 41240

  open world testing triples factory:
    entities: 17050
    relations: 48
    triples: 27577

It offers .closed_world, .open_world_valid, and .open_world_test TriplesFactories, and irt2keen/keen2irt id-mappings.

Pytorch Dataset

For users of pytorch and/or pytorch-lightning.

We offer a torch.utils.data.Dataset, a torch.utils.data.DataLoader and a pytorch_lightning.DataModule. The dataset abstracts what a "sample" is and how to collate samples to batches:

from irt import TorchDataset

# given you have loaded a irt.Dataset instance called "dataset"
# 'model_name' is one of huggingface.co/models
torch_dataset = TorchDataset(
    model_name='bert-base-cased',
    dataset=dataset,
    part=dataset.split.closed_world,
)

# a sample is an entity-to-token-index mapping:
torch_dataset[100]
# -> Tuple[int, List[int]]
# (124, [[101, 1130, ...], ...])

# and it offers a collator for batching:
batch = TorchDataset.collate_fn([torch_dataset[0], torch_dataset[1]])
# batch: Tuple[Tuple[int], torch.Tensor]

len(batch[0])   # -> 60
batch[1].shape  # -> 60, 105

Note: Only the first invocation is slow, because the tokenizer needs to run. The tokenized text is saved to the IRT folder under torch/ and re-used from then on.

Bring Your Own Data

If you want to utilize this code to create your own open-world/closed-world-split, you need to either bring your data in a format readable by the existing code base or extend this code for your own data model. See ipynb/graph.split.ipynb for a step-by-step guide.

Legacy Data

This data is used as upstream source or was used in the original experiments for the paper. They are left here for documentation and to allow for reproduction of the original results. You need to go back to this commit in irtm to use the data for model training.

Name Description Download
fb-contexts-v7 Original dataset (our text) as used in the paper (all modes, all context sizes) Link
fb-owe Original dataset (Wikidata descriptions provided by shah/OWE) Link
fb-db-contexts-v7 Our text sampled by ecc for FB Link
cde-contexts-v7 Original dataset (our text) as used in the paper (all modes, all contexts sizes) Link
cde-codex.en Original dataset (Texts provided by tsafavi/codex) Link
cde-db-contexts-v7 Our text sampled by ecc for CDE Link

Citation

If this is useful to you, please consider a citation:

@inproceedings{hamann2021open,
  title={Open-World Knowledge Graph Completion Benchmarks for Knowledge Discovery},
  author={Hamann, Felix and Ulges, Adrian and Krechel, Dirk and Bergmann, Ralph},
  booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},
  pages={252--264},
  year={2021},
  organization={Springer}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

irt-data-1.2.1.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

irt_data-1.2.1-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file irt-data-1.2.1.tar.gz.

File metadata

  • Download URL: irt-data-1.2.1.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for irt-data-1.2.1.tar.gz
Algorithm Hash digest
SHA256 0ad7798f6605ed8e5d0beb43e27628862b884f30a6cc1ffa16909397de760eca
MD5 ba840d9cdb32646977563eae6c493fcc
BLAKE2b-256 0d877dea035ae798e28f96e1ac4b9a8e6bdd195eca0b4f4ac7a5666979eff61b

See more details on using hashes here.

File details

Details for the file irt_data-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: irt_data-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for irt_data-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e7b1017a26f52ba16b07d56311a11988405ec4f42c439af780d1a4e25efe94f7
MD5 b89f70f5cbbd009bddd1fc6ef86c515f
BLAKE2b-256 e55cb92ea7f6cbff5b3cd999a23ef777905b3bba88ab6d36c34bda5f2d6dd2f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page