Skip to main content

Toolkit for Wiki-Entity-Summarization datasets (WikES) that helps end-users work with WikES datasets.

Project description

wiki-entity-summarization-toolkit

PyPI - StatusPyPI - DownloadsGitHub LicensearXiv

A user-friendly toolkit for the Wiki-Entity-Summarization (WikES) and ESBM datasets.

Parent project

This toolkit is part of the Wiki-Entity-Summarization project (WikES).

Installation

pip install wikes-toolkit

Usage

WikES usage (object based)

Using this toolkit with Object-Oriented Programming (OOP) is straightforward.

Here is an example of how to use it:

from wikes_toolkit import WikESToolkit, WikESVersions, WikESGraph

toolkit = WikESToolkit(save_path="./data")  # save_path is optional
G = toolkit.load_graph(
    WikESGraph,
    WikESVersions.V1.WikiCinema.SMALL,
    entity_formatter=lambda e: f"Entity({e.wikidata_label})",
    predicate_formatter=lambda p: f"Predicate({p.label})",
    triple_formatter=lambda
        t: f"({t.subject_entity.wikidata_label})-[{t.predicate.label}]-> ({t.object_entity.wikidata_label})"
)

root_nodes = G.root_entities()
nodes = G.entities()
nx_graph = G._G
edges = G.triples()
labels = G.predicates()
number_of_nodes = G.total_entities()
number_of_directed_edges = G.total_triples()
node = G.fetch_entity('Q303')
node_degree = G.degree('Q303')
neighbors = G.ground_truths(node)
ground_truth_summaries = G.ground_truths(root_nodes[0])
G.mark_triple_as_summary(edges[0].subject_entity, edges[0])

for root in G.root_entities():
    print(f"Neighbors of [{root}]:")
    for triple in G.neighbors(root):
        print(triple)

    for _ in range(5):
        print("*" * 40)

    print("Ground truth summaries:")
    for summary in G.ground_truths(root):
        print(summary)
    G.mark_triples_as_summaries(root, G.neighbors(root))
    break

f1 = G.f1_score()
f1_5 = G.f1_score(5)
f1_10 = G.f1_score(10)
map = G.map_score()
map_5 = G.map_score(5)
map_10 = G.map_score(10)
""" Output of the above code:
Neighbors of [Entity(Elvis Presley)]:
(Elvis Presley)-[military unit]-> (32nd Cavalry Regiment)
(Elvis Presley)-[genre]-> (blues)
...
(Jim Morrison)-[influenced by]-> (Elvis Presley)
(Elvis Country – I'm 10,000 Years Old)-[performer]-> (Elvis Presley)
(The King)-[main subject]-> (Elvis Presley)
****************************************
Ground truth summaries:
(Elvis Presley)-[genre]-> (rockabilly)
(Million Dollar Quartet)-[has part(s)]-> (Elvis Presley)
(Jailhouse Rock)-[cast member]-> (Elvis Presley)
...
(Viva Las Vegas)-[cast member]-> (Elvis Presley)
(Elvis Presley)-[genre]-> (rhythm and blues)
(Elvis Presley)-[record label]-> (Sun Records)
(Elvis Presley)-[genre]-> (pop music)
"""

To load all graphs, you can use the load_all_graphs method:

from wikes_toolkit import WikESToolkit, WikESVersions, WikESGraph

for dataset_name, G in WikESToolkit(save_path="./data").load_all_graphs(WikESGraph, WikESVersions.V1):
    print(f"Dataset [{dataset_name}:")
    print(G.total_entities())

Pandas usage

There is another version of this toolkit that uses Pandas DataFrame to store the graph data. To use this version, you can change the first parameter of the load_graph method to PandasWikESGraph:

from wikes_toolkit import WikESToolkit, PandasWikESGraph, WikESVersions

toolkit = WikESToolkit()
G = toolkit.load_graph(PandasWikESGraph, WikESVersions.V1.WikiLitArt.SMALL, entity_formatter=lambda e: e.identifier)

root_nodes = G.root_entities()
first_root_node = G.root_entity_ids()[0]
nodes = G.entities()
edges = G.triples()
first_edge = G.fetch_triple(edges.iloc[0])
labels = G.predicates()
number_of_nodes = G.total_entities()
number_of_directed_edges = G.total_triples()
node = G.fetch_entity('Q303')
node_degree = G.degree('Q303')
ground_truths = G.ground_truths(node)
neighbors = G.neighbors(node)
ground_truth_summaries = G.ground_truths(first_root_node)
G.mark_triple_as_summary(
    ground_truths.iloc[0].name,
    (ground_truths.iloc[0]['subject'], ground_truths.iloc[0]['predicate'], ground_truths.iloc[0]['object'])
)
G.mark_triples_as_summaries(node, neighbors.iloc[0:2])
f1 = G.f1_score()
f1_5 = G.f1_score(5)
f1_10 = G.f1_score(10)
map = G.map_score()
map_norel = G.map_score(no_rel=True)
map_5 = G.map_score(5)
map_10 = G.map_score(10)
G.clear_summaries()

Export WikESGraphs as CSV files

You can export the graphs as CSV files using the export_as_csv method. This method exports the graph as three CSV files:

from wikes_toolkit import WikESToolkit, WikESVersions, PandasWikESGraph

toolkit = WikESToolkit()

for dataset, G in toolkit.load_all_graphs(PandasWikESGraph, WikESVersions.V1):
    G.export_as_csv("./data")

Note: This method has been implemented for WikESPandasGraph, others yet to be implemented.

ESBM

We also cover ESBM datasets with almost the same functionalities and syntax as before. If you want to check how we have converted ESBM datasets to NetworkX format, or if you want to use GraphML formats of this dataset, check ESBM-to-nx-format repository.

Here is an example of loading ESBM and ESM_Plus datasets with this toolkit:

from wikes_toolkit import WikESToolkit, ESBMGraph, ESBMVersions

toolkit = WikESToolkit()
G = toolkit.load_graph(
    ESBMGraph,
    ESBMVersions.Plus.DBPEDIA_FULL,  # or ESBMVersions.Plus.DBPEDIA_FULL, ESBMVersions.V1Dot2.LMDB_TRAIN_1, etc.
    entity_formatter=lambda e: e.identifier
)

root_nodes = G.root_entities()
first_root_node = G.root_entity_ids()[0]
nodes = G.entities()
edges = G.triples()
labels = G.predicates()
number_of_nodes = G.total_entities()
number_of_directed_edges = G.total_triples()
node = G.fetch_entity('http://dbpedia.org/resource/Adrian_Griffin')
node_degree = G.degree('http://dbpedia.org/resource/Adrian_Griffin')
gold_top5_0 = G.gold_top_5(node, 0)
gold_top10_0 = G.gold_top_10(node, 0)
neighbors = G.neighbors(node)  # or G.neighbors('http://dbpedia.org/resource/Adrian_Griffin')

G.mark_triples_as_summaries(
    "http://dbpedia.org/resource/3WAY_FM",
    [
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://purl.org/dc/terms/subject',
            'http://dbpedia.org/resource/Category:Radio_stations_in_Victoria'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://xmlns.com/foaf/0.1/homepage',
            'http://3wayfm.org.au'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://dbpedia.org/ontology/broadcastArea',
            'http://dbpedia.org/resource/Victoria_(Australia)'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://dbpedia.org/ontology/callsignMeaning',
            '3 - Victoria'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://xmlns.com/foaf/0.1/name',
            '3WAY FM@en'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://www.w3.org/2000/01/rdf-schema#label',
            '3WAY FM@en'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://dbpedia.org/ontology/callsignMeaning',
            'Warrnambool And You'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://dbpedia.org/ontology/broadcastArea',
            'http://dbpedia.org/resource/Warrnambool'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
            'http://schema.org/RadioStation'
        ),
        (
            'http://dbpedia.org/resource/3WAY_FM',
            'http://purl.org/dc/terms/subject',
            'http://dbpedia.org/resource/Category:Radio_stations_established_in_1990'
        )]
)

G.clear_summaries()

G.mark_triples_as_summaries(
    "http://dbpedia.org/resource/3WAY_FM",
    [(
        'http://dbpedia.org/resource/3WAY_FM',
        'http://purl.org/dc/terms/subject',
        'http://dbpedia.org/resource/Category:Radio_stations_in_Victoria'
    )]
)

G.mark_triples_as_summaries(
    "http://dbpedia.org/resource/Adrian_Griffin",
    [
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://dbpedia.org/ontology/activeYearsEndYear',
            '2008'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://purl.org/dc/terms/subject',
            'http://dbpedia.org/resource/Category:Orlando_Magic_assistant_coaches'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://dbpedia.org/ontology/birthDate',
            '1974-07-04'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://dbpedia.org/ontology/height',
            '1.9558'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://dbpedia.org/ontology/weight',
            '98431.2'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://dbpedia.org/ontology/team',
            'http://dbpedia.org/resource/Orlando_Magic'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://purl.org/dc/terms/subject',
            'http://dbpedia.org/resource/Category:Basketball_players_from_Kansas'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
            'http://dbpedia.org/class/yago/BasketballPlayersFromKansas'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
            'http://dbpedia.org/class/yago/BasketballCoach109841955'
        ),
        (
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
            'http://dbpedia.org/class/yago/BasketballPlayer109842047'
        ),
    ]
)

f1_5 = G.f1_score(5)
f1_10 = G.f1_score(10)
map_5 = G.map_score(5)
map_10 = G.map_score(10)

Using Ranked NT File as Summary (Legacy Support)

Although this toolkit can process triples marked with .nt ranked files, we don't recommend using this feature as it doesn't align with our toolkit's core approach.

from wikes_toolkit import WikESToolkit, PandasESBMGraph, ESBMGraph, ESBMVersions

toolkit = WikESToolkit()
G = toolkit.load_graph(
    ESBMGraph, # or PandasESBMGraph
    ESBMVersions.V1Dot2.DBPEDIA_FULL,
    entity_formatter=lambda e: e.identifier,
)

# Mark summaries for '3WAY_FM'
G.mark_nt_file_as_summary(
    1,  # '3WAY_FM' eid
    './result/dbpedia/1/1_rank.nt'  # .nt file path
)

# Mark summaries for 'Adrian_Griffin'
G.mark_nt_file_as_summary(
    'http://dbpedia.org/resource/Adrian_Griffin',  # ESBM entity name
    './result/dbpedia/2/2_rank.nt'  # .nt file path
)

f1_5 = G.f1_score(5)
f1_10 = G.f1_score(10)

ESBM Pandas usage

For Pandas version of ESBM datasets, you can use PandasESBMGraph instead of ESBMGraph:

from wikes_toolkit import WikESToolkit, PandasESBMGraph, ESBMVersions
import pandas as pd

toolkit = WikESToolkit()
G = toolkit.load_graph(PandasESBMGraph, ESBMVersions.V1Dot2.DBPEDIA_FULL, entity_formatter=lambda e: e.identifier)

first_root_node_id = G.root_entity_ids()[0]
root_nodes = G.root_entities()
first_root_node = G.root_entities().iloc[0]
nodes = G.entities()
edges = G.triples()
labels = G.predicates()
number_of_nodes = G.total_entities()
number_of_directed_edges = G.total_triples()
node = G.fetch_entity('http://dbpedia.org/resource/Adrian_Griffin')
node_degree = G.degree('http://dbpedia.org/resource/Adrian_Griffin')
gold_top5_0 = G.gold_top_5(node, 0)
gold_top10_0 = G.gold_top_10(node, 0)
neighbors = G.neighbors(node)
all_golds = G.all_gold_top_k(5)
golds = G.gold_top_5(G.root_entities().iloc[0], 4)

G.mark_triples_as_summaries(
    "http://dbpedia.org/resource/Adrian_Griffin",
    pd.Series(
        [
            'http://dbpedia.org/resource/Adrian_Griffin',
            'http://dbpedia.org/ontology/activeYearsEndYear',
            '2008'
        ],
        index=['subject', 'predicate', 'object']
    )
)
f1_5 = G.f1_score(5)

Citation

If you use this project in your research, please cite the following paper:

@misc{javadi2024wiki,
    title = {Wiki Entity Summarization Benchmark},
    author = {Saeedeh Javadi and Atefeh Moradan and Mohammad Sorkhpar and Klim Zaporojets and Davide Mottin and Ira Assent},
    year = {2024},
    eprint = {2406.08435},
    archivePrefix = {arXiv},
    primaryClass = {cs.IR}
}

In case you use ESBM datasets, please cite their paper as well ours:

@inproceedings{esbm,
    author = {Qingxia Liu and
 Gong Cheng and
 Kalpa Gunaratna and
 Yuzhong Qu},
    title = {ESBM: An Entity Summarization Benchmark},
    booktitle = {ESWC},
    year = {2020}
}

License

This project is licensed under the CC-BY-4.0 License. See the LICENSE file for details. Also, you can check ESBM license here or here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikes_toolkit-1.0.22.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikes_toolkit-1.0.22-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file wikes_toolkit-1.0.22.tar.gz.

File metadata

  • Download URL: wikes_toolkit-1.0.22.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for wikes_toolkit-1.0.22.tar.gz
Algorithm Hash digest
SHA256 e97b50b910b8f2885280e3729fb0ac29f43f92d55aa3f2c6e8f54ef80102996c
MD5 97093c10937173a2c8aae8c2158077d6
BLAKE2b-256 53087f06e6a23c32a6ab03b32b503bfaf49b12d1e76bb1f5b70ac49bfd5b664b

See more details on using hashes here.

File details

Details for the file wikes_toolkit-1.0.22-py3-none-any.whl.

File metadata

  • Download URL: wikes_toolkit-1.0.22-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for wikes_toolkit-1.0.22-py3-none-any.whl
Algorithm Hash digest
SHA256 8810a1306f9a3c4daac2b1af10bb9d41ae0d6f9e6452ce838110fb79a5463431
MD5 10619d3def42ba6923e17a6e66b15dfd
BLAKE2b-256 bb3ba204e084d7efd6b561101610f8d2104f165466f13b42d7df604566274e16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page