Benchmark datasets for Entity Resolution on Knowledge Graphs containing information about movies, tv shows and persons from IMDB,TMDB and TheTVDB

These details have not been verified by PyPI

Project links

Project description

!! Update 2024-02-24 (fixed in 1.1.0) !!

We found that ent_links in some cases contained intra-dataset links, which is not immediately noticable by the user. Another round of clerical review was performed, (transitive) links, which were previously missed are added and the ent_links files now only contain entity links between the datasets. The 721_5fold directories have been adapted accordingly. The intra-dataset links are now in {dataset_name}_intra_ent_links for each of the three datasets. What might also not be immediately obvious is that this dataset can be used as multi-source entity resolution task. We therefore provide a multi_source_cluster file with each line consisting of a cluster id and comma-seperated cluster members of the three datasets, which can also include multiple entries for a single dataset.

Dataset License

Due to licensing we are not allowed to distribute the IMDB datasets (more info on their license can be found here) What we can do is let you build the IMDB side of the entity resolution datasets yourself. Please be aware, that the mentioned license applies to the IMDB data you produce.

Usage

You can simply install the package via pip:

pip install moviegraphbenchmark

and then run

moviegraphbenchmark

which will create the data in the default data path ~/.data/moviegraphbenchmark/data

You can also define a specific folder if you want with

moviegraphbenchmark --data-path anotherpath

For ease-of-usage in your project you can also use this library for loading the data (this will create the data if it's not present):

from moviegraphbenchmark import load_data
ds = load_data()
# by default this will load `imdb-tmdb`
print(ds.attr_triples_1)

# specify other pair and specific data path
ds = load_data(pair="imdb-tmdb",data_path="anotherpath")

# the dataclass contains all the files loaded as pandas dataframes
print(ds.attr_triples_2)
print(ds.rel_triples_1)
print(ds.rel_triples_2)
print(ds.ent_links)
for fold in ds.folds:
    print(fold)

# the intra-dataset links are stored as tuple of dataframes
print(ds.intra_ent_links[0])
print(ds.intra_ent_links[1])

Alternatively this dataset (among others) is also available in sylloge.

Dataset structure

There are 3 entity resolution tasks in this repository: imdb-tmdb, imdb-tvdb, tmdb-tvdb, all contained in the data folder. The data structure mainly follows the structure used in OpenEA. Each folder contains the information of the knowledge graphs (attr_triples_*,rel_triples_*) and the gold standard of entity links between the datasets(ent_links). The triples are labeled with 1 and 2 where e.g. for imdb-tmdb 1 refers to imdb and 2 to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation. Furthermore, there exists a file for each dataset with intra-dataset links called *_intra_ent_links. For the binary cases each dataset has a cluster file in the respective folder. Each line here is a cluster with comma-seperated members of the cluster. This includes intra- and inter-dataset links. For the multi-source setting, you can use the multi_source_cluster file in the data folder. Using sylloge you can also easily load this dataset as a multi-source task:

from sylloge import MovieGraphBenchmark
ds = MovieGraphBenchmark(graph_pair='multi')

Citing

This dataset was first presented in this paper:

@inproceedings{EAGERKGCW2021,
  author    = {Daniel Obraczka and
               Jonathan Schuchart and
               Erhard Rahm},
  editor    = {David Chaves-Fraga and
               Anastasia Dimou and
               Pieter Heyvaert and
               Freddy Priyatna and
               Juan Sequeda},
  title     = {Embedding-Assisted Entity Resolution for Knowledge Graphs},
  booktitle = {Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC 2021), Online, June 5, 2021},
  series    = {{CEUR} Workshop Proceedings},
  volume    = {2873},
  publisher = {CEUR-WS.org},
  year      = {2021},
  url       = {http://ceur-ws.org/Vol-2873/paper8.pdf},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Mar 13, 2024

1.0.2

Mar 24, 2023

1.0.1

Aug 16, 2022

1.0.0

Jul 18, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moviegraphbenchmark-1.1.0.tar.gz (10.7 kB view details)

Uploaded Mar 13, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

moviegraphbenchmark-1.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Mar 13, 2024 Python 3

File details

Details for the file moviegraphbenchmark-1.1.0.tar.gz.

File metadata

Download URL: moviegraphbenchmark-1.1.0.tar.gz
Upload date: Mar 13, 2024
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for moviegraphbenchmark-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d16da553732e6696250cdb4ceb85201de155bd424509546935c0bb947ce893f5`
MD5	`b9def751356ad5c7f8f30e2672270223`
BLAKE2b-256	`0181e66bbcbbff289654cdc7f2ea20ee2704225910e03d99f6d038653d135801`

See more details on using hashes here.

File details

Details for the file moviegraphbenchmark-1.1.0-py3-none-any.whl.

File metadata

Download URL: moviegraphbenchmark-1.1.0-py3-none-any.whl
Upload date: Mar 13, 2024
Size: 9.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for moviegraphbenchmark-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91d8cd05000f5d2310e3d526ceff7403ec494ff9929a6c26e1700c326f63dd37`
MD5	`b61631f3caaa5fa711a00af78b50ff30`
BLAKE2b-256	`448261cf093d735568881890a111ea770ff0fdeb82273c0c45475bdefc30a2c5`

See more details on using hashes here.

moviegraphbenchmark 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

!! Update 2024-02-24 (fixed in 1.1.0) !!

Dataset License

Usage

Dataset structure

Citing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes