Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget
Project description
Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget
A blazing fast and lightweight Information Extraction model for Entity Linking and Relation Extraction.
Installation
Installation from PyPI
pip install relik
Other installation options
Install with optional dependencies
Install with all the optional dependencies.
pip install relik[all]
Install with optional dependencies for training and evaluation.
pip install relik[train]
Install with optional dependencies for FAISS
FAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.
For CPU:
pip install relik[faiss]
For GPU:
conda create -n relik python=3.10
conda activate relik
# install pytorch
conda install -y pytorch=2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
# GPU
conda install -y -c pytorch -c nvidia faiss-gpu=1.8.0
# or GPU with NVIDIA RAFT
conda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0
pip install relik
Install with optional dependencies for serving the models with FastAPI and Ray.
pip install relik[serve]
Installation from source
git clone https://github.com/SapienzaNLP/relik.git
cd relik
pip install -e .[all]
Quick Start
ReLiK is a lightweight and fast model for Entity Linking and Relation Extraction. It is composed of two main components: a retriever and a reader. The retriever is responsible for retrieving relevant documents from a large collection of documents, while the reader is responsible for extracting entities and relations from the retrieved documents. ReLiK can be used with the from_pretrained
method to load a pre-trained pipeline.
Here is an example of how to use ReLiK for Entity Linking:
from relik import Relik
relik = Relik.from_pretrained("sapienzanlp/relik-entity-linking-large")
relik("Michael Jordan was one of the best players in the NBA.")
and for Relation Extraction:
from relik import Relik
relik = Relik.from_pretrained("sapienzanlp/relik-relation-extraction-large")
relik("Michael Jordan was one of the best players in the NBA.")
The full list of available models can be found on ๐ค Hugging Face.
Retrievers and Readers can be used separately:
from relik import Relik
# If you want to use the retriever only
retriever = Relik.from_pretrained("sapienzanlp/relik-relation-extraction-large", reader=None)
# If you want to use the reader only
reader = Relik.from_pretrained("sapienzanlp/relik-relation-extraction-large", retriever=None)
CLI
ReLiK provides a CLI to perform inference on a text file or a directory of text files. The CLI can be used as follows:
relik inference --help
Usage: relik inference [OPTIONS] MODEL_NAME_OR_PATH INPUT_PATH OUTPUT_PATH
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * model_name_or_path TEXT [default: None] [required] โ
โ * input_path TEXT [default: None] [required] โ
โ * output_path TEXT [default: None] [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --batch-size INTEGER [default: 8] โ
โ --num-workers INTEGER [default: 4] โ
โ --device TEXT [default: cuda] โ
โ --precision TEXT [default: fp16] โ
โ --top-k INTEGER [default: 100] โ
โ --window-size INTEGER [default: None] โ
โ --window-stride INTEGER [default: None] โ
โ --annotation-type TEXT [default: char] โ
โ --progress-bar --no-progress-bar [default: progress-bar] โ
โ --model-kwargs TEXT [default: None] โ
โ --inference-kwargs TEXT [default: None] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
For example:
relik inference sapienzanlp/relik-entity-linking-large data.txt output.jsonl
Before You Start
In the following sections, we provide a step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.
Entity Linking
All your data should have the following starting structure:
{
"doc_id": int, # Unique identifier for the document
"doc_text": txt, # Text of the document
"doc_annotations": # Char level annotations
[
[start, end, label],
[start, end, label],
...
]
}
We used BLINK (Wu et al., 2019) and AIDA (Hoffart et al, 2011) datasets for training and evaluation. More specifically, we used the BLINK dataset for pre-training the retriever and the AIDA dataset for fine-tuning the retriever and training the reader.
The BLINK dataset can be downloaded from the GENRE repo from here.
We used blink-train-kilt.jsonl
and blink-dev-kilt.jsonl
as training and validation datasets.
Assuming we have downloaded the two files in the data/blink
folder, we converted the BLINK dataset to the ReLiK format using the following script:
# Train
python scripts/data/blink/preprocess_genre_blink.py \
data/blink/blink-train-kilt.jsonl \
data/blink/processed/blink-train-kilt-relik.jsonl
# Dev
python scripts/data/blink/preprocess_genre_blink.py \
data/blink/blink-dev-kilt.jsonl \
data/blink/processed/blink-dev-kilt-relik.jsonl
The AIDA dataset is not publicly available, but we provide the file we used without text
field. You can find the file in ReLiK format in data/aida/processed
folder.
The Wikipedia index we used can be downloaded from here.
Relation Extraction
TODO
Retriever
We perform a two-step training process for the retriever. First, we "pre-train" the retriever using BLINK (Wu et al., 2019) dataset and then we "fine-tune" it using AIDA (Hoffart et al, 2011).
Data Preparation
The retriever requires a dataset in a format similar to DPR: a jsonl
file where each line is a dictionary with the following keys:
{
"question": "....",
"positive_ctxs": [{
"title": "...",
"text": "...."
}],
"negative_ctxs": [{
"title": "...",
"text": "...."
}],
"hard_negative_ctxs": [{
"title": "...",
"text": "...."
}]
}
The retriever also needs an index to search for the documents. The documents to index can be either a jsonl file or a tsv file similar to DPR:
jsonl
: each line is a json object with the following keys:id
,text
,metadata
tsv
: each line is a tab-separated string with theid
andtext
column, followed by any other column that will be stored in themetadata
field
jsonl example:
{
"id": "...",
"text": "...",
"metadata": ["{...}"]
},
...
tsv example:
id \t text \t any other column
...
Entity Linking
BLINK
Once you have the BLINK dataset in the ReLiK format, you can create the windows with the following script:
# train
python scripts/data/create_windows.py \
data/blink/processed/blink-train-kilt-relik.jsonl \
data/blink/processed/blink-train-kilt-relik-windowed.jsonl
# dev
python scripts/data/create_windows.py \
data/blink/processed/blink-dev-kilt-relik.jsonl \
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
and then convert it to the DPR format:
# train
python scripts/data/blink/convert_to_dpr.py \
data/blink/processed/blink-train-kilt-relik-windowed.jsonl \
data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl
# dev
python scripts/data/blink/convert_to_dpr.py \
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl \
data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
AIDA
Since the AIDA dataset is not publicly available, we can provide the annotations for the AIDA dataset in the ReLiK format as an example.
Assuming you have the full AIDA dataset in the data/aida
, you can convert it to the ReLiK format and then create the windows with the following script:
python scripts/data/create_windows.py \
data/data/processed/aida-train-relik.jsonl \
data/data/processed/aida-train-relik-windowed.jsonl
and then convert it to the DPR format:
python scripts/data/convert_to_dpr.py \
data/data/processed/aida-train-relik-windowed.jsonl \
data/data/processed/aida-train-relik-windowed-dpr.jsonl
Training the model
The relik retriever train
command can be used to train the retriever. It requires the following arguments:
config_path
: The path to the configuration file.overrides
: A list of overrides to the configuration file, in the formatkey=value
.
Examples of configuration files can be found in the relik/retriever/conf
folder.
Entity Linking
The configuration files in relik/retriever/conf
are pretrain_iterable_in_batch.yaml
and finetune_iterable_in_batch.yaml
, which we used to pre-train and fine-tune the retriever, respectively.
For instance, to train the retriever on the AIDA dataset, you can run the following command:
relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \
model.language_model=intfloat/e5-base-v2 \
train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl \
val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl \
test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
Relation Extraction
TODO
Inference
By passing train.only_test=True
to the relik retriever train
command, you can skip the training and only evaluate the model.
It needs also the path to the PyTorch Lightning checkpoint and the dataset to evaluate on.
relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \
train.only_test=True \
test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
model.checkpoint_path=path/to/checkpoint
The retriever encoder can be saved from the checkpoint with the following command:
from relik.retriever.lightning_modules.pl_modules import GoldenRetrieverPLModule
checkpoint_path = "path/to/checkpoint"
retriever_folder = "path/to/retriever"
# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"
pl_module = GoldenRetrieverPLModule.load_from_checkpoint(checkpoint_path)
pl_module.model.save_pretrained(retriever_folder, push_to_hub=push_to_hub, repo_id=repo_id)
with push_to_hub=True
the model will be pushed to the ๐ค Hugging Face Hub with repo_id
the repository id where the model will be pushed.
The retriever needs a index to search for the documents. The index can be created using relik retriever build-index
command
relik retriever build-index --help
Usage: relik retriever build-index [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH
DOCUMENT_PATH OUTPUT_FOLDER
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * question_encoder_name_or_path TEXT [default: None] [required] โ
โ * document_path TEXT [default: None] [required] โ
โ * output_folder TEXT [default: None] [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --document-file-type TEXT [default: jsonl] โ
โ --passage-encoder-name-or-path TEXT [default: None] โ
โ --indexer-class TEXT [default: relik.retriever.indexers.inmemory.InMemoryDocumentIndex] โ
โ --batch-size INTEGER [default: 512] โ
โ --num-workers INTEGER [default: 4] โ
โ --passage-max-length INTEGER [default: 64] โ
โ --device TEXT [default: cuda] โ
โ --index-device TEXT [default: cpu] โ
โ --precision TEXT [default: fp32] โ
โ --push-to-hub --no-push-to-hub [default: no-push-to-hub] โ
โ --repo-id TEXT [default: None] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
With the encoder and the index, the retriever can be loaded from a repo id or a local path:
from relik.retriever import GoldenRetriever
encoder_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"
index_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index"
retriever = GoldenRetriever(
question_encoder=encoder_name_or_path,
document_index=index_name_or_path,
device="cuda", # or "cpu"
precision="16", # or "32", "bf16"
index_device="cuda", # or "cpu"
index_precision="16", # or "32", "bf16"
)
and then it can be used to retrieve documents:
retriever.retrieve("Michael Jordan was one of the best players in the NBA.", top_k=100)
Reader
The reader is responsible for extracting entities and relations from documents from a set of candidates (e.g., possible entities or relations).
The reader can be trained for span extraction or triplet extraction.
The RelikReaderForSpanExtraction
is used for span extraction, i.e. Entity Linking , while the RelikReaderForTripletExtraction
is used for triplet extraction, i.e. Relation Extraction.
Data Preparation
The reader requires the windowized dataset we created in section Before You Start augmented with the candidate from the retriever.
The candidate can be added to the dataset using the relik retriever add-candidates
command.
relik retriever add-candidates --help
Usage: relik retriever add-candidates [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH
DOCUMENT_NAME_OR_PATH INPUT_PATH
OUTPUT_PATH
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * question_encoder_name_or_path TEXT [default: None] [required] โ
โ * document_name_or_path TEXT [default: None] [required] โ
โ * input_path TEXT [default: None] [required] โ
โ * output_path TEXT [default: None] [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --passage-encoder-name-or-path TEXT [default: None] โ
โ --top-k INTEGER [default: 100] โ
โ --batch-size INTEGER [default: 128] โ
โ --num-workers INTEGER [default: 4] โ
โ --device TEXT [default: cuda] โ
โ --index-device TEXT [default: cpu] โ
โ --precision TEXT [default: fp32] โ
โ --use-doc-topics --no-use-doc-topics [default: no-use-doc-topics] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Training the model
Similar to the retriever, the relik reader train
command can be used to train the retriever. It requires the following arguments:
config_path
: The path to the configuration file.overrides
: A list of overrides to the configuration file, in the formatkey=value
.
Examples of configuration files can be found in the relik/reader/conf
folder.
Entity Linking
The configuration files in relik/reader/conf
are large.yaml
and base.yaml
, which we used to train the large and base reader, respectively.
For instance, to train the large reader on the AIDA dataset run:
relik reader train relik/reader/conf/large.yaml \
train_dataset_path=data/aida/processed/aida-train-relik-windowed-candidates.jsonl \
val_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl \
test_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl
Relation Extraction
TODO
Inference
The reader can be saved from the checkpoint with the following command:
from relik.reader.lightning_modules.relik_reader_pl_module import RelikReaderPLModule
checkpoint_path = "path/to/checkpoint"
reader_folder = "path/to/reader"
# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-reader-deberta-v3-large-aida"
pl_model = RelikReaderPLModule.load_from_checkpoint(
trainer.checkpoint_callback.best_model_path
)
pl_model.relik_reader_core_model.save_pretrained(experiment_path, push_to_hub=push_to_hub, repo_id=repo_id)
with push_to_hub=True
the model will be pushed to the ๐ค Hugging Face Hub with repo_id
the repository id where the model will be pushed.
The reader can be loaded from a repo id or a local path:
from relik.reader import RelikReaderForSpanExtraction, RelikReaderForTripletExtraction
# the reader for span extraction
reader_span = RelikReaderForSpanExtraction(
"sapienzanlp/relik-reader-deberta-v3-large-aida"
)
# the reader for triplet extraction
reader_tripltes = RelikReaderForTripletExtraction(
"sapienzanlp/relik-reader-deberta-v3-large-nyt"
)
and used to extract entities and relations:
# an example of candidates for the reader
candidates = ["Michael Jordan", "NBA", "Chicago Bulls", "Basketball", "United States"]
reader_span.read("Michael Jordan was one of the best players in the NBA.", candidates=candidates)
Performance
Entity Linking
We evaluate the performance of ReLiK on Entity Linking using GERBIL. The following table shows the results (InKB Micro F1) of ReLiK Large and Base:
Model | AIDA-B | MSNBC | Der | K50 | R128 | R500 | OKE15 | OKE16 | AVG | AVG-OOD | Speed (ms) |
---|---|---|---|---|---|---|---|---|---|---|---|
Base | 85.25 | 72.27 | 55.59 | 68.02 | 48.13 | 41.61 | 62.53 | 52.25 | 60.71 | 57.2 | n |
Large | 86.37 | 75.04 | 56.25 | 72.8 | 51.67 | 42.95 | 65.12 | 57.21 | 63.43 | 60.15 | n |
To evaluate ReLiK we use the following steps:
-
Download the GERBIL server from here.
-
Start the GERBIL server:
cd gerbil && ./start.sh
- Start the following services:
cd gerbil-SpotWrapNifWS4Test && mvn clean -Dmaven.tomcat.port=1235 tomcat:run
- Start the ReLiK server for GERBIL providing the model name as an argument (e.g.
sapienzanlp/relik-entity-linking-large
):
python relik/reader/utils/gerbil_server.py --relik-model-name sapienzanlp/relik-entity-linking-large
- Open the url http://localhost:1234/gerbil and:
- Select A2KB as experiment type
- Select "Ma - strong annotation match"
- In Name filed write the name you want to give to the experiment
- In URI field write: http://localhost:1235/gerbil-spotWrapNifWS4Test/myalgorithm
- Select the datasets (We use AIDA-B, MSNBC, Der, K50, R128, R500, OKE15, OKE16)
- Finally, run experiment
Relation Extraction
- TODO
Cite this work
If you use any part of this work, please consider citing the paper as follows:
@inproceedings{orlando-etal-2024-relik,
title = "Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget",
author = "Orlando, Riccardo and Huguet Cabot, Pere-Llu{\'\i}s and Barba, Edoardo and Navigli, Roberto",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
}
License
TODO
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file relik-1.0.0.dev0.tar.gz
.
File metadata
- Download URL: relik-1.0.0.dev0.tar.gz
- Upload date:
- Size: 145.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40f8ad1f9458d21f18862e2bb416b08569d7ac783071f4d843e9856ae87c4c14 |
|
MD5 | 446080e4c52e155bd0805ff46cd0dc50 |
|
BLAKE2b-256 | d67af5919764421268ad41a8db5d4d0e463c15aa71d00eca865a7a99fe9df66e |
Provenance
File details
Details for the file relik-1.0.0.dev0-py3-none-any.whl
.
File metadata
- Download URL: relik-1.0.0.dev0-py3-none-any.whl
- Upload date:
- Size: 181.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41ee7947457935d60a4a3994b55ecfb6fc779c94a5679b8a512376df36a3a286 |
|
MD5 | e4e7c44553467db6eac6f74dc9f0310a |
|
BLAKE2b-256 | fae4fa8dec23611fa38b9a48cd259cd278a5a2ad46ca497330e80f9fbbc9a323 |