Skip to main content

Small library to simplify collecting and loading of entity alignment benchmark datasets

Project description

sylloge logo

sylloge

Actions Status Documentation Status Stable python versions Code style: black

This simple library aims to collect entity-alignment benchmark datasets and make them easily available.

Usage

Load benchmark datasets:

>>> from sylloge import OpenEA
>>> ds = OpenEA()
>>> ds
OpenEA(backend=pandas, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
>>> ds.rel_triples_right.head()
                                       head                             relation                                    tail
0   http://www.wikidata.org/entity/Q6176218   http://www.wikidata.org/entity/P27     http://www.wikidata.org/entity/Q145
1   http://www.wikidata.org/entity/Q212675  http://www.wikidata.org/entity/P161  http://www.wikidata.org/entity/Q446064
2   http://www.wikidata.org/entity/Q13512243  http://www.wikidata.org/entity/P840      http://www.wikidata.org/entity/Q84
3   http://www.wikidata.org/entity/Q2268591   http://www.wikidata.org/entity/P31   http://www.wikidata.org/entity/Q11424
4   http://www.wikidata.org/entity/Q11300470  http://www.wikidata.org/entity/P178  http://www.wikidata.org/entity/Q170420
>>> ds.attr_triples_left.head()
                                  head                                          relation                                               tail
0  http://dbpedia.org/resource/E534644                http://dbpedia.org/ontology/imdbId                                            0044475
1  http://dbpedia.org/resource/E340590               http://dbpedia.org/ontology/runtime  6480.0^^<http://www.w3.org/2001/XMLSchema#double>
2  http://dbpedia.org/resource/E840454  http://dbpedia.org/ontology/activeYearsStartYear     1948^^<http://www.w3.org/2001/XMLSchema#gYear>
3  http://dbpedia.org/resource/E971710       http://purl.org/dc/elements/1.1/description                          English singer-songwriter
4  http://dbpedia.org/resource/E022831       http://dbpedia.org/ontology/militaryCommand                     Commandant of the Marine Corps

The gold standard entity links are stored as [eche](https://github.com/dobraczka/eche) ClusterHelper, which provides convenient functionalities:

>>> ds.ent_links.clusters[0]
{'http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186'}
>>> ('http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186') in ds.ent_links
True
>>> ('http://dbpedia.org/resource/E123186', 'http://www.wikidata.org/entity/Q21197') in ds.ent_links
True
>>> ds.ent_links.links('http://www.wikidata.org/entity/Q21197')
'http://dbpedia.org/resource/E123186'
>>> ds.ent_links.all_pairs()
<itertools.chain object at 0x7f92c6287c10>

Most datasets are binary matching tasks, but for example the MovieGraphBenchmark provides a multi-source setting:

>>> ds = MovieGraphBenchmark(graph_pair="multi")
>>> ds
MovieGraphBenchmark(backend=pandas,graph_pair=multi, rel_triples_0=17507, attr_triples_0=20800 rel_triples_1=27903, attr_triples_1=23761 rel_triples_2=15455, attr_triples_2=20902, ent_links=3598, folds=5)
>>> ds.dataset_names
('imdb', 'tmdb', 'tvdb')

Here the PrefixedClusterHelper various convenience functions:

Get pairs between specific dataset pairs

>>> list(ds.ent_links.pairs_in_ds_tuple(("imdb","tmdb")))[0]
('https://www.scads.de/movieBenchmark/resource/IMDB/nm0641721', 'https://www.scads.de/movieBenchmark/resource/TMDB/person1236714')

Get number of intra-dataset pairs
>>> ds.ent_links.number_of_intra_links
(1, 64, 22663)

For all datasets you can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:

>>> ds.canonical_name
'openea_d_w_15k_v1'

You can use dask as backend for larger datasets:

>>> ds = OpenEA(backend="dask")
>>> ds
OpenEA(backend=dask, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)

Which replaces pandas DataFrames with dask DataFrames.

Datasets can be written/read as parquet via to_parquet or read_parquet. After the initial read datasets are cached using this format. The cache_path can be explicitly set and caching behaviour can be disable via use_cache=False, when initalizing a dataset.

Some datasets come with pre-determined splits:

tree ~/.data/sylloge/open_ea/cached/D_W_15K_V1
├── attr_triples_left_parquet
├── attr_triples_right_parquet
├── dataset_names.txt
├── ent_links_parquet
├── folds
│   ├── 1      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   ├── 2      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   ├── 3      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   ├── 4      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   └── 5       ├── test_parquet
│       ├── train_parquet
│       └── val_parquet
├── rel_triples_left_parquet
└── rel_triples_right_parquet

some don't:

tree ~/.data/sylloge/oaei/cached/starwars_swg
├── attr_triples_left_parquet
│   └── part.0.parquet
├── attr_triples_right_parquet
│   └── part.0.parquet
├── dataset_names.txt
├── ent_links_parquet
│   └── part.0.parquet
├── rel_triples_left_parquet
│   └── part.0.parquet
└── rel_triples_right_parquet
    └── part.0.parquet

Installation

pip install sylloge

Datasets

Dataset family name Year # of Datasets Sources References
OpenEA 2020 16 DBpedia, Yago, Wikidata Paper, Repo
MED-BBK 2020 1 Baidu Baike Paper, Repo
MovieGraphBenchmark 2022 3 IMDB, TMDB, TheTVDB Paper, Repo
OAEI 2022 5 Fandom wikis Paper, Website

More broad statistics are provided in dataset_statistics.csv. You can also get a pandas DataFrame with statistics for specific datasets for example to create tables for publications:

>>> ds = MovieGraphBenchmark(graph_pair="multi")
>>> from sylloge.base import create_statistics_df
>>> stats_df = create_statistics_df([ds])
>>> stats_df.loc[("MovieGraphBenchmark","moviegraphbenchmark_multi","imdb")]
                                                            Entities  Relation Triples  Attribute Triples  ...  Clusters  Intra-dataset Matches  All Matches
Dataset family      Task Name                 Dataset Name                                                 ...
MovieGraphBenchmark moviegraphbenchmark_multi imdb              5129             17507              20800  ...      3598                      1        31230

[1 rows x 9 columns]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sylloge-0.3.0.tar.gz (25.6 kB view details)

Uploaded Source

Built Distribution

sylloge-0.3.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file sylloge-0.3.0.tar.gz.

File metadata

  • Download URL: sylloge-0.3.0.tar.gz
  • Upload date:
  • Size: 25.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for sylloge-0.3.0.tar.gz
Algorithm Hash digest
SHA256 45fe7fc223bb09cfe04201cc4ca37983e56710d7e384e479db199287faf3fba5
MD5 44a574864064feb7dcbba3e81e3f0996
BLAKE2b-256 87b3fc8bd407349590b040d77c3b049ddc974bf96e98f64c22495424ce9c39ab

See more details on using hashes here.

File details

Details for the file sylloge-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: sylloge-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for sylloge-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4d715a4361c38cf115afb93bd88573fe7e736edddcdc33f9c228a533b1b1064
MD5 46b1cda4330081ca075ab4716b2466a4
BLAKE2b-256 2d0924febbbeae3b653081689b5dd710f8f3d0c1cf04f7652349459f3b823b79

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page