Skip to main content

A Unified Library for Entity Linking

Project description

entity-linkings

entity-linkings is an unified library for entity linking.

PyPi GitHub

Instllation

# from PyPi (ToDO)
pip install entity-linkings

# from the source
git clone git@github.com:YuSawan/entity_linkings.git
cd entity_linkings
pip install .

# for uv users
git clone git@github.com:YuSawan/entity_linkings.git
cd entity_linkings
uv sync

Quick Start

entity-linkigs provides two interfaces: command-line interface (CLI) and Python API.

CLI

Command-line interface can train/evalate/run Entity Linkings system from command-line. To create EL system, you must build candidate retriever with entitylinkings-train_retrieval. In this example, e5bm25 can be executed with custom dataset.

entitylinkings-train-retrieval \
    --retriever_id  e5bm25 \
    --train_file train.jsonl \
    --validation_file validation.jsonl \
    --dictionary_id_or_path dictionary.jsonl \
    --output_dir save_model/ \
    --num_hard_negatives 4 \
    --num_train_epochs 10 \
    --train_batch_size 8 \
    --validation_batch_size 16 \
    --config config.yaml \
    --wandb

Next, Entity Disambiguation (ED) and End-to-End Entity Linking (EL) systems can trained with entitylinkings-train. This example is the FEVRY with custom candidate retriever.

entitylinkings-train \
    --model_type ed \
    --model_id fevry \
    --model_name_or_path google-bert/bert-base-uncased \
    --retriever_id e5bm25 \
    --retriever_model_name_or_path save_model/ \
    --dictionary_id_or_path dictionary.jsonl \
    --train_file train.jsonl \
    --validation_file validation.jsonl \
    --num_candidates 30 \
    --num_train_epochs 2 \
    --train_batch_size 8 \
    --validation_batch_size 16 \
    --output_dir save_fevry/ \
    --config config.yaml \
    --wandb

Finally, you can evaluate Retriever or EL systems with entitylinkings-eval or entitylinkings-eval-retrieval, respectively.

entitylinkings-eval-retrieval \
    --retriever_id <retriever_id> \
    --model_name_or_path save_model/ \
    --dictionary_id_or_path dictionary.jsonl \
    --test_file test.jsonl \
    --config config.yaml \
    --output_dir result/ \
    --test_batch_size 256 \
    --wandb
entitylinkings-eval \
    --model_type ed \
    --model_id fevry \
    --model_name_or_path save_fevry/ \
    --retriever_id e5bm25 \
    --retriever_model_name_or_path save_model/ \
    --dictionary_id_or_path dictionary.jsonl \
    --test_file test.jsonl \
    --config config.yaml \
    --output_dir result/ \
    --test_batch_size 256 \
    --wandb

You can change the arguments (e.g., context length) using configuration file. The config.yaml with default values can be generated via entitylinkings-gen-config.

entitylinkings-gen-config

Python API

This is the exemple of ChatEL with Zelda Candidate list via API. Valids IDs for get_retrievers and get_models() can be found with get_retriever_ids and get_model_ids() respectively.

from entity_linkings import get_retrievers, get_models, load_dictionary

# Load Dictionary from dictionary_id or local path
dictionary = load_dictionary('zelda')

# Load Candidate Retriever
retriever_cls = get_retrievers('zeldacl')
retriever = retriever_cls(
    dictionary,
    config=retriever_cls.Config()
)

# Setup ED or EL models
model_cls = get_models('chatel')
model = model_cls(
    task='ed'
    retriever=retriever,
    config=model_cls.Config("gpt-4o")
)

# Prediction
sentences = "NAIST is in Ikoma."
spans = [(0, 5)]
predictions = model.predict(sentence, spans, top_k=1)

print("ID: ", predictions[0][0]["id"])
print("Title: ", predictions[0][0]["prediction"])
print("Score: ",  predictions[0][0]["score"])

Available Models

Candidate Retriever

Entity Disambiguation

Entity Dictionary

Available Dictionaries

dictionary_id Dataset Language Domain
kilt_wiki KILT (Petroni et al., 2021) English Wikipedia
zelda_wiki ZELDA (Milich and Akbik., 2023) English Wikipedia
zeshel_wikia ZeshEL (Logeswaran et al., 2021) English Wikia
  • Please obtain the source data for the entity dictionary from the following link.
  • If you place the data in entity_linkings/entity_dictionary/dictionary_id/, load_dictionary(<dictionary_id>). will automatically convert the data.
  • We plan to support downloading these dictionaries directly via libraries such as HuggingFace Datasets.

Custom Entity Dictionary

If you want to use our packages with your custom ontologies, you need to convert to the following format:

{
  "id": "000011",
  "name": "NAIST",
  "description": "NAIST is located in Ikoma."
}

Datasets

Public datasets

dataset_id Dataset Domain Language Ontology Train Licence
msnbc MSNBC (Cucerzan, 2007) News English Wikipedia Unknown*
aquaint AQUAINT (Milne and Witten, 2008) News English Wikipedia Unknown*
ace2004 ACE2004 (Ratinov et al, 2011) News English Wikipedia Unknown*
kore50 KORE50 (Hoffart et al., 2012) News English Wikipedia CC BY-SA 3.0
n3-r128 N3-Reuters-128 (R̈oder et al., 2014) News English Wikipedia GNU AGPL-3.0
n3-r500 N3-RSS-500 (R̈oder et al., 2014) RSS English Wikipedia GNU AGPL-3.0
derczynski Derczynski (Derczynski et al., 2015) Twitter English Wikipedia CC-BY 4.0
oke-2015 OKE-2015 (Nuzzolese et al., 2015) News English Wikipedia Unknown*
oke-2016 OKE-2016 (Nuzzolese et al., 2015) News English Wikipedia Unknown*
wned-wiki WNED-WIKI (Guo and Barbosa, 2018) Wikipedia English Wikipedia Unknown
wned-cweb WNED-CWEB (Guo and Barbosa, 2018) Web English Wikipedia Apache License 2.0
unseen WikilinksNED Unseen-Mentions (Onoe and Durrett, 2020) News English Wikipedia CC-BY 3.0*
tweeki Tweeki EL (Harandizadeh and Singh, 2020) Twitter English Wikipedia Apache License 2.0
reddit-comments Reddit EL (Botzer et al., 2021) Reddit English Wikipedia CC-BY 4.0
reddit-posts Reddit EL (Botzer et al., 2021) Reddit English Wikipedia CC-BY 4.0
shadowlink-shadow ShadowLink (Provatorova et al., 2021) Wikipedia English Wikipedia Unknown*
shadowlink-top ShadowLink (Provatorova et al., 2021) Wikipedia English Wikipedia Unknown*
shadowlink-tail ShadowLink (Provatorova et al., 2021) Wikipedia English Wikipedia Unknown*
zeshel Zeshel (Logeswaran et al., 2021) Wikia English Wikia CC-BY-SA
docred Linked-DocRED (Genest et al., 2023) News English Wikipedia CC-BY 4.0
  • Original MSNBC (Cucerzan, 2007) is not available due to expiration of the official link. You can download the dataset at GERBIL official code.
  • ShadownLink, OKE-{2015,2016} are uncertain to publicly use, but they are provided at official repositories.
  • WikilinksNED Unseen-Mentions is created by splitting the WikilinksNED. The WikilinksNED is derived from the Wikilinks corpus, which is made available under CC-BY 3.0.
  • The folowing datasests is not publicly available or uncertain. If you want to evaluate these resource, please register the LDC and convert these dataset to our format.
    • TACKBP-2010 (Ji et al., 2011): You must sign Text Analysis Conference (TAC) Knowledge Base Population Evaluation License Agreement.

Custom Dataset

If you want to use our packages with the your private dataset, you must convert it to the following format:

{
  "id": "doc-001-P1",
  "text": "She graduated from NAIST.",
  "entities": [{"start": 19, "end": 24, "label": ["000011"]}],
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entity_linkings-0.1.0.tar.gz (71.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entity_linkings-0.1.0-py3-none-any.whl (123.6 kB view details)

Uploaded Python 3

File details

Details for the file entity_linkings-0.1.0.tar.gz.

File metadata

  • Download URL: entity_linkings-0.1.0.tar.gz
  • Upload date:
  • Size: 71.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for entity_linkings-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d7e09f563988324e0f00c5713fbf6220c39df6ed432d6dfa3ccfc950f413edf8
MD5 3f24853f5b95eab9f279f6508978b809
BLAKE2b-256 2756884f0bde1887fb414098a33762137b11741a499894f850599f0e3853e533

See more details on using hashes here.

File details

Details for the file entity_linkings-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for entity_linkings-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3896fbf08297db5038c016df9115421e4b48b7f1fe3bcb29e7f47e427be6f775
MD5 9dc3714c20af0a7fa848a65822b48ba9
BLAKE2b-256 3e98c3d3d1c09f4dbfe22dd06f563995169bbd90e76b2abe7bbb524a02ba20be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page