Skip to main content

Small library to simplify collecting and loading of entity alignment benchmark datasets

Project description

sylloge logo

sylloge

Actions Status Documentation Status Stable python versions Code style: black

This simple library aims to collect entity-alignment benchmark datasets and make them easily available.

Usage

Load benchmark datasets:

>>> from sylloge import OpenEA
>>> ds = OpenEA()
>>> ds
OpenEA(backend=pandas, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
>>> ds.rel_triples_right.head()
                                       head                             relation                                    tail
0   http://www.wikidata.org/entity/Q6176218   http://www.wikidata.org/entity/P27     http://www.wikidata.org/entity/Q145
1   http://www.wikidata.org/entity/Q212675  http://www.wikidata.org/entity/P161  http://www.wikidata.org/entity/Q446064
2   http://www.wikidata.org/entity/Q13512243  http://www.wikidata.org/entity/P840      http://www.wikidata.org/entity/Q84
3   http://www.wikidata.org/entity/Q2268591   http://www.wikidata.org/entity/P31   http://www.wikidata.org/entity/Q11424
4   http://www.wikidata.org/entity/Q11300470  http://www.wikidata.org/entity/P178  http://www.wikidata.org/entity/Q170420
>>> ds.attr_triples_left.head()
                                  head                                          relation                                               tail
0  http://dbpedia.org/resource/E534644                http://dbpedia.org/ontology/imdbId                                            0044475
1  http://dbpedia.org/resource/E340590               http://dbpedia.org/ontology/runtime  6480.0^^<http://www.w3.org/2001/XMLSchema#double>
2  http://dbpedia.org/resource/E840454  http://dbpedia.org/ontology/activeYearsStartYear     1948^^<http://www.w3.org/2001/XMLSchema#gYear>
3  http://dbpedia.org/resource/E971710       http://purl.org/dc/elements/1.1/description                          English singer-songwriter
4  http://dbpedia.org/resource/E022831       http://dbpedia.org/ontology/militaryCommand                     Commandant of the Marine Corps
>>> ds.ent_links.head()
                                  left                                    right
0  http://dbpedia.org/resource/E123186    http://www.wikidata.org/entity/Q21197
1  http://dbpedia.org/resource/E228902  http://www.wikidata.org/entity/Q5909974
2  http://dbpedia.org/resource/E718575   http://www.wikidata.org/entity/Q707008
3  http://dbpedia.org/resource/E469216  http://www.wikidata.org/entity/Q1471945
4  http://dbpedia.org/resource/E649433  http://www.wikidata.org/entity/Q1198381

You can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:

   >>> ds.canonical_name
   'openea_d_w_15k_v1'

Create id-mapped dataset for embedding-based methods:

>>> from sylloge import IdMappedEADataset
>>> id_mapped_ds = IdMappedEADataset.from_ea_dataset(ds)
>>> id_mapped_ds
IdMappedEADataset(rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, entity_mapping=30000, rel_mapping=417, attr_rel_mapping=990, attr_mapping=138836, folds=5)
>>> id_mapped_ds.rel_triples_right
[[26048   330 16880]
 [19094   293 23348]
 [16554   407 29192]
 ...
 [16480   330 15109]
 [18465   254 19956]
 [26040   290 28560]]

You can use dask as backend for larger datasets:

>>> ds = OpenEA(backend="dask")
>>> ds
OpenEA(backend=dask, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)

Which replaces pandas DataFrames with dask DataFrames.

Datasets can be written/read as parquet via to_parquet or read_parquet. After the initial read datasets are cached using this format. The cache_path can be explicitly set and caching behaviour can be disable via use_cache=False, when initalizing a dataset.

Some datasets come with pre-determined splits:

tree ~/.data/sylloge/open_ea/cached/D_W_15K_V1 
├── attr_triples_left_parquet
├── attr_triples_right_parquet
├── dataset_names.txt
├── ent_links_parquet
├── folds
│   ├── 1      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   ├── 2      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   ├── 3      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   ├── 4      ├── test_parquet
│      ├── train_parquet
│      └── val_parquet
│   └── 5       ├── test_parquet
│       ├── train_parquet
│       └── val_parquet
├── rel_triples_left_parquet
└── rel_triples_right_parquet

some don't:

tree ~/.data/sylloge/oaei/cached/starwars_swg
├── attr_triples_left_parquet
│   └── part.0.parquet
├── attr_triples_right_parquet
│   └── part.0.parquet
├── dataset_names.txt
├── ent_links_parquet
│   └── part.0.parquet
├── rel_triples_left_parquet
│   └── part.0.parquet
└── rel_triples_right_parquet
    └── part.0.parquet

Installation

pip install sylloge 

Datasets

Dataset family name Year # of Datasets Sources References
OpenEA 2020 16 DBpedia, Yago, Wikidata Paper, Repo
MovieGraphBenchmark 2022 3 IMDB, TMDB, TheTVDB Paper, Repo
OAEI 2022 5 Fandom wikis Paper, Website

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sylloge-0.2.1.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sylloge-0.2.1-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file sylloge-0.2.1.tar.gz.

File metadata

  • Download URL: sylloge-0.2.1.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for sylloge-0.2.1.tar.gz
Algorithm Hash digest
SHA256 db612d5bfb28e01e174e1ba5d5d84cd441257aa5b4645e2fc587ce0e0dfba0f4
MD5 822bd64bba505cbaf97465792f06e895
BLAKE2b-256 15e4b7e3444826219cad14d828825bfa33ddbacaa9411f009fd9e75e297b6af8

See more details on using hashes here.

File details

Details for the file sylloge-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: sylloge-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic

File hashes

Hashes for sylloge-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c6d65c05579e92f0142bc56ac3517a1ced0cadd22951f87f4dfda820c8c638e9
MD5 d6d377f48a6a0314d6ab56e24ebc891a
BLAKE2b-256 638f2b18dc70866ede0ea302dea8194810fdbd11b393710d26ce500d21f602ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page