Small library to simplify collecting and loading of entity alignment benchmark datasets
Project description
sylloge
This simple library aims to collect entity-alignment benchmark datasets and make them easily available.
Usage
Load benchmark datasets:
>>> from sylloge import OpenEA
>>> ds = OpenEA()
>>> ds
OpenEA(backend=pandas, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
>>> ds.rel_triples_right.head()
head relation tail
0 http://www.wikidata.org/entity/Q6176218 http://www.wikidata.org/entity/P27 http://www.wikidata.org/entity/Q145
1 http://www.wikidata.org/entity/Q212675 http://www.wikidata.org/entity/P161 http://www.wikidata.org/entity/Q446064
2 http://www.wikidata.org/entity/Q13512243 http://www.wikidata.org/entity/P840 http://www.wikidata.org/entity/Q84
3 http://www.wikidata.org/entity/Q2268591 http://www.wikidata.org/entity/P31 http://www.wikidata.org/entity/Q11424
4 http://www.wikidata.org/entity/Q11300470 http://www.wikidata.org/entity/P178 http://www.wikidata.org/entity/Q170420
>>> ds.attr_triples_left.head()
head relation tail
0 http://dbpedia.org/resource/E534644 http://dbpedia.org/ontology/imdbId 0044475
1 http://dbpedia.org/resource/E340590 http://dbpedia.org/ontology/runtime 6480.0^^<http://www.w3.org/2001/XMLSchema#double>
2 http://dbpedia.org/resource/E840454 http://dbpedia.org/ontology/activeYearsStartYear 1948^^<http://www.w3.org/2001/XMLSchema#gYear>
3 http://dbpedia.org/resource/E971710 http://purl.org/dc/elements/1.1/description English singer-songwriter
4 http://dbpedia.org/resource/E022831 http://dbpedia.org/ontology/militaryCommand Commandant of the Marine Corps
>>> ds.ent_links.head()
left right
0 http://dbpedia.org/resource/E123186 http://www.wikidata.org/entity/Q21197
1 http://dbpedia.org/resource/E228902 http://www.wikidata.org/entity/Q5909974
2 http://dbpedia.org/resource/E718575 http://www.wikidata.org/entity/Q707008
3 http://dbpedia.org/resource/E469216 http://www.wikidata.org/entity/Q1471945
4 http://dbpedia.org/resource/E649433 http://www.wikidata.org/entity/Q1198381
You can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:
>>> ds.canonical_name
'openea_d_w_15k_v1'
Create id-mapped dataset for embedding-based methods:
>>> from sylloge import IdMappedEADataset
>>> id_mapped_ds = IdMappedEADataset.from_ea_dataset(ds)
>>> id_mapped_ds
IdMappedEADataset(rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, entity_mapping=30000, rel_mapping=417, attr_rel_mapping=990, attr_mapping=138836, folds=5)
>>> id_mapped_ds.rel_triples_right
[[26048 330 16880]
[19094 293 23348]
[16554 407 29192]
...
[16480 330 15109]
[18465 254 19956]
[26040 290 28560]]
You can use dask as backend for larger datasets:
>>> ds = OpenEA(backend="dask")
>>> ds
OpenEA(backend=dask, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
Which replaces pandas DataFrames with dask DataFrames.
Datasets can be written/read as parquet via to_parquet or read_parquet.
After the initial read datasets are cached using this format. The cache_path can be explicitly set and caching behaviour can be disable via use_cache=False, when initalizing a dataset.
Some datasets come with pre-determined splits:
tree ~/.data/sylloge/open_ea/cached/D_W_15K_V1
├── attr_triples_left_parquet
├── attr_triples_right_parquet
├── dataset_names.txt
├── ent_links_parquet
├── folds
│ ├── 1
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ ├── 2
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ ├── 3
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ ├── 4
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ └── 5
│ ├── test_parquet
│ ├── train_parquet
│ └── val_parquet
├── rel_triples_left_parquet
└── rel_triples_right_parquet
some don't:
tree ~/.data/sylloge/oaei/cached/starwars_swg
├── attr_triples_left_parquet
│ └── part.0.parquet
├── attr_triples_right_parquet
│ └── part.0.parquet
├── dataset_names.txt
├── ent_links_parquet
│ └── part.0.parquet
├── rel_triples_left_parquet
│ └── part.0.parquet
└── rel_triples_right_parquet
└── part.0.parquet
Installation
pip install sylloge
Datasets
| Dataset family name | Year | # of Datasets | Sources | References |
|---|---|---|---|---|
| OpenEA | 2020 | 16 | DBpedia, Yago, Wikidata | Paper, Repo |
| MovieGraphBenchmark | 2022 | 3 | IMDB, TMDB, TheTVDB | Paper, Repo |
| OAEI | 2022 | 5 | Fandom wikis | Paper, Website |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sylloge-0.2.1.tar.gz.
File metadata
- Download URL: sylloge-0.2.1.tar.gz
- Upload date:
- Size: 20.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db612d5bfb28e01e174e1ba5d5d84cd441257aa5b4645e2fc587ce0e0dfba0f4
|
|
| MD5 |
822bd64bba505cbaf97465792f06e895
|
|
| BLAKE2b-256 |
15e4b7e3444826219cad14d828825bfa33ddbacaa9411f009fd9e75e297b6af8
|
File details
Details for the file sylloge-0.2.1-py3-none-any.whl.
File metadata
- Download URL: sylloge-0.2.1-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.19.0-38-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6d65c05579e92f0142bc56ac3517a1ced0cadd22951f87f4dfda820c8c638e9
|
|
| MD5 |
d6d377f48a6a0314d6ab56e24ebc891a
|
|
| BLAKE2b-256 |
638f2b18dc70866ede0ea302dea8194810fdbd11b393710d26ce500d21f602ea
|