Benchmark datasets for Entity Resolution on Knowledge Graphs containing information about movies, tv shows and persons from IMDB,TMDB and TheTVDB
Project description
Dataset License
Due to licensing we are not allowed to distribute the IMDB datasets (more info on their license can be found here) What we can do is let you build the IMDB side of the entity resolution datasets yourself. Please be aware, that the mentioned license applies to the IMDB data you produce.
Usage
You can simply install the package via pip:
pip install moviegraphbenchmark
and then run
moviegraphbenchmark
which will create the data in the default data path ~/.data/moviegraphbenchmark/data
You can also define a specific folder if you want with
moviegraphbenchmark --data-path anotherpath
For ease-of-usage in your project you can also use this library for loading the data (this will create the data if it's not present):
from moviegraphbenchmark import load_data
ds = load_data()
# by default this will load `imdb-tmdb`
print(ds.attr_triples_1)
# specify other pair and specific data path
ds = load_data(pair="imdb-tmdb",data_path="anotherpath")
# the dataclass contains all the files loaded as pandas dataframes
print(ds.attr_triples_2)
print(ds.rel_triples_1)
print(ds.rel_triples_2)
print(ds.ent_links)
for fold in in ds.folds:
print(fold)
Dataset structure
There are 3 entity resolution tasks in this repository: imdb-tmdb, imdb-tvdb, tmdb-tvdb, all contained in the data
folder.
The data structure follows the structure used in OpenEA.
Each folder contains the information of the knowledge graphs (attr_triples_*
,rel_triples_*
) and the gold standard of entity links (ent_links
). The triples are labeled with 1
and 2
where e.g. for imdb-tmdb 1
refers to imdb and 2
to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation.
Citing
This dataset was first presented in this paper:
@inproceedings{EAGERKGCW2021,
author = {Daniel Obraczka and
Jonathan Schuchart and
Erhard Rahm},
editor = {David Chaves-Fraga and
Anastasia Dimou and
Pieter Heyvaert and
Freddy Priyatna and
Juan Sequeda},
title = {Embedding-Assisted Entity Resolution for Knowledge Graphs},
booktitle = {Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC 2021), Online, June 5, 2021},
series = {{CEUR} Workshop Proceedings},
volume = {2873},
publisher = {CEUR-WS.org},
year = {2021},
url = {http://ceur-ws.org/Vol-2873/paper8.pdf},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for moviegraphbenchmark-1.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3d9771328e74b4897ecaa9dd620d8da6a314d35b5962d1955c3703af5f61144 |
|
MD5 | 49b46c680611883d5564215bf5b33101 |
|
BLAKE2b-256 | 94c72ae3f45d98036b8875f90b3b652cd453623f38db12ee4836cecff2f6db55 |
Hashes for moviegraphbenchmark-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 563950280c237d78ebf9662bad376872744487268edf113df2d146d41d3e0951 |
|
MD5 | 22f3dbb9fec99b0d3cb4fde80207944a |
|
BLAKE2b-256 | 45daa3076158473a77a1b7cafbcfcd69cfae71a9f6619d8d041692fea1f0e4ce |