Inductive Reasoning with Text - Benchmarks
Project description
Inductive Reasoning with Text (IRT)
Table of Contents
This code is used to create benchmark datasets as described in Open-World Knowledge Graph Completion Benchmarks for Knowledge Discovery from a given knowledge graph (i.e. triple set) and supplementary text. The two KG's evaluated in the paper (based on FB15k237 and CoDEx) are available for download below.
Download
We offer two IRT reference datasets: The first - IRT-FB - is based on FB15k237 and the second - IRT-CDE - utilizes CoDEx. Each dataset offers knowledge graph triples for the closed world (cw) and open world (ow) split. The ow-split is partitioned into validation and test data. Each entity of the KG is assigned a set of text contexts of mentions of that entity.
Name | Description | Download |
---|---|---|
IRT-CDE | Based on CoDEx | Link |
IRT-FB | Based on FB15k237 | Link |
Installation
Python 3.9 is required. We recommend miniconda for managing Python environments.
conda create -n irt python=3.9
conda activate irt
pip install irt-data
The requirements.txt
contains additional packages used for development.
Loading
Simply provide a path to an IRT dataset folder. The data is loaded
lazily - that is why the construction is fast, but the first invocation
of .description
takes a while.
from irt import Dataset
dataset = Dataset('path/to/irt-fb')
print(dataset.description)
IRT DATASET
IRT GRAPH: irt-fb
nodes: 14541
edges: 310116 (237 types)
degree:
mean 42.65
median 26
IRT SPLIT
2389 retained concepts
Config:
seed: 26041992
ow split: 0.7
ow train split: 0.5
relation threshold: 100
git: 66fe7bd3c934311bdc3b1aa380b7c6c45fd7cd93
date: 2021-07-21 17:29:04.339909
Closed World - TRAIN:
owe: 12164
entities: 12164
heads: 11562
tails: 11252
triples: 238190
Open World - VALID:
owe: 1558
entities: 9030
heads: 6907
tails: 6987
triples: 46503
Open World - TEST:
owe: 819
entities: 6904
heads: 4904
tails: 5127
triples: 25423
IRT Text (Mode.CLEAN)
mean contexts: 28.92
median contexts: 30.00
mean mentions: 2.84
median mentions: 2.00
Data Formats
The data in the respective provided dataset folders should be quite
self-explanatory. Each entity and each relation is assigned a unique
integer id (denoted e
[entity], h
[head], t
[tail], and r
[relation]). There is folder containing the full graph data
(graph/
), a folder containing the open-world/closed-world splits
(split/
) and the textual data (text/
).
Graph Data
This concerns both data in graph/
and split/
. Entity and relation
identifier can be translated with the graph/entities.txt
and
graph/relations.txt
respectively. Triple sets come in h t r
order. Reference code to load graph data:
irt.graph.Graph.load
irt.data.dataset.Split.load
Text Data
The upstream system that sampled our texts:
ecc. All
text comes gzipped and can be opened using the built-in python gzip
library. For inspection, you can use the zcat
, zless
, zgrep
,
etc. (at least on unixoid systems ;)) - or extract them using
unzip
. Reference code to load text data:
irt.data.dataset.Text.load
PyKEEN Dataset
For users of pykeen. There are two "views" on the triple sets: closed-world and open-world. Both simply offer pykeen TriplesFactories with an id-mapping to the IRT entity-ids.
Closed-World:
from irt import Dataset
from irt import KeenClosedWorld
dataset = Dataset('path/to/dataset')
# 'split' is either a single float, a tuple (for an additional
# test split) or a triple which must sum to 1
kcw = KeenClosedWorld(dataset=dataset, split=.8, seed=1234)
print(kcw.description)
IRT PYKEEN DATASET
irt-cde
training triples factory:
entities: 12091
relations: 51
triples: 109910
validation triples factory:
entities: 12091
relations: 51
triples: 27478
It offers .training
, .validation
, and .testing
TriplesFactories,
and irt2keen
/keen2irt
id-mappings.
Open-World:
from irt import Dataset
from irt import KeenClosedWorld
dataset = Dataset('path/to/dataset')
kow = KeenOpenWorld(dataset=ds)
print(kow.description)
IRT PYKEEN DATASET
irt-cde
closed world triples factory:
entities: 12091
relations: 51
triples: 137388
open world validation triples factory:
entities: 15101
relations: 46
triples: 41240
open world testing triples factory:
entities: 17050
relations: 48
triples: 27577
It offers .closed_world
, .open_world_valid
, and .open_world_test
TriplesFactories, and irt2keen
/keen2irt
id-mappings.
Pytorch Dataset
For users of pytorch and/or pytorch-lightning.
We offer a torch.utils.data.Dataset
, a torch.utils.data.DataLoader
and a pytorch_lightning.DataModule
. The dataset abstracts what a
"sample" is and how to collate samples to batches:
from irt import TorchDataset
# given you have loaded a irt.Dataset instance called "dataset"
# 'model_name' is one of huggingface.co/models
torch_dataset = TorchDataset(
model_name='bert-base-cased',
dataset=dataset,
part=dataset.split.closed_world,
)
# a sample is an entity-to-token-index mapping:
torch_dataset[100]
# -> Tuple[int, List[int]]
# (124, [[101, 1130, ...], ...])
# and it offers a collator for batching:
batch = TorchDataset.collate_fn([torch_dataset[0], torch_dataset[1]])
# batch: Tuple[Tuple[int], torch.Tensor]
len(batch[0]) # -> 60
batch[1].shape # -> 60, 105
Note: Only the first invocation is slow, because the tokenizer needs
to run. The tokenized text is saved to the IRT folder under torch/
and re-used from then on.
Bring Your Own Data
If you want to utilize this code to create your own open-world/closed-world-split, you need to either bring your data in a format readable by the existing code base or extend this code for your own data model. See ipynb/graph.split.ipynb for a step-by-step guide.
Legacy Data
This data is used as upstream source or was used in the original experiments for the paper. They are left here for documentation and to allow for reproduction of the original results. You need to go back to this commit in irtm to use the data for model training.
Name | Description | Download |
---|---|---|
fb-contexts-v7 | Original dataset (our text) as used in the paper (all modes, all context sizes) | Link |
fb-owe | Original dataset (Wikidata descriptions provided by shah/OWE) | Link |
fb-db-contexts-v7 | Our text sampled by ecc for FB | Link |
cde-contexts-v7 | Original dataset (our text) as used in the paper (all modes, all contexts sizes) | Link |
cde-codex.en | Original dataset (Texts provided by tsafavi/codex) | Link |
cde-db-contexts-v7 | Our text sampled by ecc for CDE | Link |
Citation
If this is useful to you, please consider a citation:
@inproceedings{hamann2021open,
title={Open-World Knowledge Graph Completion Benchmarks for Knowledge Discovery},
author={Hamann, Felix and Ulges, Adrian and Krechel, Dirk and Bergmann, Ralph},
booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},
pages={252--264},
year={2021},
organization={Springer}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file irt-data-1.2.1.tar.gz
.
File metadata
- Download URL: irt-data-1.2.1.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ad7798f6605ed8e5d0beb43e27628862b884f30a6cc1ffa16909397de760eca |
|
MD5 | ba840d9cdb32646977563eae6c493fcc |
|
BLAKE2b-256 | 0d877dea035ae798e28f96e1ac4b9a8e6bdd195eca0b4f4ac7a5666979eff61b |
File details
Details for the file irt_data-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: irt_data-1.2.1-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7b1017a26f52ba16b07d56311a11988405ec4f42c439af780d1a4e25efe94f7 |
|
MD5 | b89f70f5cbbd009bddd1fc6ef86c515f |
|
BLAKE2b-256 | e55cb92ea7f6cbff5b3cd999a23ef777905b3bba88ab6d36c34bda5f2d6dd2f1 |