Skip to main content

Benchmark datasets for Entity Resolution on Knowledge Graphs containing information about movies, tv shows and persons from IMDB,TMDB and TheTVDB

Project description

!! Update 2024-02-24 (fixed in 1.1.0) !!

We found that ent_links in some cases contained intra-dataset links, which is not immediately noticable by the user. Another round of clerical review was performed, (transitive) links, which were previously missed are added and the ent_links files now only contain entity links between the datasets. The 721_5fold directories have been adapted accordingly. The intra-dataset links are now in {dataset_name}_intra_ent_links for each of the three datasets. What might also not be immediately obvious is that this dataset can be used as multi-source entity resolution task. We therefore provide a multi_source_cluster file with each line consisting of a cluster id and comma-seperated cluster members of the three datasets, which can also include multiple entries for a single dataset.

Dataset License

Due to licensing we are not allowed to distribute the IMDB datasets (more info on their license can be found here) What we can do is let you build the IMDB side of the entity resolution datasets yourself. Please be aware, that the mentioned license applies to the IMDB data you produce.

Usage

You can simply install the package via pip:

pip install moviegraphbenchmark

and then run

moviegraphbenchmark

which will create the data in the default data path ~/.data/moviegraphbenchmark/data

You can also define a specific folder if you want with

moviegraphbenchmark --data-path anotherpath

For ease-of-usage in your project you can also use this library for loading the data (this will create the data if it's not present):

from moviegraphbenchmark import load_data
ds = load_data()
# by default this will load `imdb-tmdb`
print(ds.attr_triples_1)

# specify other pair and specific data path
ds = load_data(pair="imdb-tmdb",data_path="anotherpath")

# the dataclass contains all the files loaded as pandas dataframes
print(ds.attr_triples_2)
print(ds.rel_triples_1)
print(ds.rel_triples_2)
print(ds.ent_links)
for fold in ds.folds:
    print(fold)

# the intra-dataset links are stored as tuple of dataframes
print(ds.intra_ent_links[0])
print(ds.intra_ent_links[1])

Alternatively this dataset (among others) is also available in sylloge.

Dataset structure

There are 3 entity resolution tasks in this repository: imdb-tmdb, imdb-tvdb, tmdb-tvdb, all contained in the data folder. The data structure mainly follows the structure used in OpenEA. Each folder contains the information of the knowledge graphs (attr_triples_*,rel_triples_*) and the gold standard of entity links between the datasets(ent_links). The triples are labeled with 1 and 2 where e.g. for imdb-tmdb 1 refers to imdb and 2 to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation. Furthermore, there exists a file for each dataset with intra-dataset links called *_intra_ent_links. For the binary cases each dataset has a cluster file in the respective folder. Each line here is a cluster with comma-seperated members of the cluster. This includes intra- and inter-dataset links. For the multi-source setting, you can use the multi_source_cluster file in the data folder. Using sylloge you can also easily load this dataset as a multi-source task:

from sylloge import MovieGraphBenchmark
ds = MovieGraphBenchmark(graph_pair='multi')

Citing

This dataset was first presented in this paper:

@inproceedings{EAGERKGCW2021,
  author    = {Daniel Obraczka and
               Jonathan Schuchart and
               Erhard Rahm},
  editor    = {David Chaves-Fraga and
               Anastasia Dimou and
               Pieter Heyvaert and
               Freddy Priyatna and
               Juan Sequeda},
  title     = {Embedding-Assisted Entity Resolution for Knowledge Graphs},
  booktitle = {Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC 2021), Online, June 5, 2021},
  series    = {{CEUR} Workshop Proceedings},
  volume    = {2873},
  publisher = {CEUR-WS.org},
  year      = {2021},
  url       = {http://ceur-ws.org/Vol-2873/paper8.pdf},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moviegraphbenchmark-1.1.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

moviegraphbenchmark-1.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file moviegraphbenchmark-1.1.0.tar.gz.

File metadata

  • Download URL: moviegraphbenchmark-1.1.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for moviegraphbenchmark-1.1.0.tar.gz
Algorithm Hash digest
SHA256 d16da553732e6696250cdb4ceb85201de155bd424509546935c0bb947ce893f5
MD5 b9def751356ad5c7f8f30e2672270223
BLAKE2b-256 0181e66bbcbbff289654cdc7f2ea20ee2704225910e03d99f6d038653d135801

See more details on using hashes here.

File details

Details for the file moviegraphbenchmark-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: moviegraphbenchmark-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for moviegraphbenchmark-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 91d8cd05000f5d2310e3d526ceff7403ec494ff9929a6c26e1700c326f63dd37
MD5 b61631f3caaa5fa711a00af78b50ff30
BLAKE2b-256 448261cf093d735568881890a111ea770ff0fdeb82273c0c45475bdefc30a2c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page