Skip to main content

Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget

Project description

Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget

Conference Paper arXiv

Hugging Face Collection Hugging Face Spaces

Lightning PyTorch Code style: black PyPi Version Release Version

A blazing fast and lightweight Information Extraction model for Entity Linking and Relation Extraction.

๐Ÿ› ๏ธ Installation

Installation from PyPI

pip install relik
Other installation options

Install with optional dependencies

Install with all the optional dependencies.

pip install relik[all]

Install with optional dependencies for training and evaluation.

pip install relik[train]

Install with optional dependencies for FAISS

FAISS PyPI package is only available for CPU. For GPU, install it from source or use the conda package.

For CPU:

pip install relik[faiss]

For GPU:

conda create -n relik python=3.10
conda activate relik

# install pytorch
conda install -y pytorch=2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia

# GPU
conda install -y -c pytorch -c nvidia faiss-gpu=1.8.0
# or GPU with NVIDIA RAFT
conda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0

pip install relik

Install with optional dependencies for serving the models with FastAPI and Ray.

pip install relik[serve]

Installation from source

git clone https://github.com/SapienzaNLP/relik.git
cd relik
pip install -e .[all]

๐Ÿš€ Quick Start

ReLiK is a lightweight and fast model for Entity Linking and Relation Extraction. It is composed of two main components: a retriever and a reader. The retriever is responsible for retrieving relevant documents from a large collection, while the reader is responsible for extracting entities and relations from the retrieved documents. ReLiK can be used with the from_pretrained method to load a pre-trained pipeline.

Here is an example of how to use ReLiK for Entity Linking:

from relik import Relik
from relik.inference.data.objects import RelikOutput

relik = Relik.from_pretrained("sapienzanlp/relik-entity-linking-large")
relik_out: RelikOutput = relik("Michael Jordan was one of the best players in the NBA.")

Output:

RelikOutput(
  text="Michael Jordan was one of the best players in the NBA.",
  tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
  id=0,
  spans=[
      Span(start=0, end=14, label="Michael Jordan", text="Michael Jordan"),
      Span(start=50, end=53, label="National Basketball Association", text="NBA"),
  ],
  triples=[],
  candidates=Candidates(
      span=[
          [
              [
                  {"text": "Michael Jordan", "id": 4484083},
                  {"text": "National Basketball Association", "id": 5209815},
                  {"text": "Walter Jordan", "id": 2340190},
                  {"text": "Jordan", "id": 3486773},
                  {"text": "50 Greatest Players in NBA History", "id": 1742909},
                  ...
              ]
          ]
      ]
  ),
)

and for Relation Extraction:

from relik import Relik
from relik.inference.data.objects import RelikOutput

relik = Relik.from_pretrained("sapienzanlp/relik-relation-extraction-nyt-large")
relik_out: RelikOutput = relik("Michael Jordan was one of the best players in the NBA.")

Output:

RelikOutput(
  text='Michael Jordan was one of the best players in the NBA.', 
  tokens=Michael Jordan was one of the best players in the NBA., 
  id=0, 
  spans=[
    Span(start=0, end=14, label='--NME--', text='Michael Jordan'), 
    Span(start=50, end=53, label='--NME--', text='NBA')
  ], 
  triplets=[
    Triplets(
      subject=Span(start=0, end=14, label='--NME--', text='Michael Jordan'), 
      label='company', 
      object=Span(start=50, end=53, label='--NME--', text='NBA'), 
      confidence=1.0
      )
  ], 
  candidates=Candidates(
    span=[], 
    triplet=[
              [
                [
                  {"text": "company", "id": 4, "metadata": {"definition": "company of this person"}}, 
                  {"text": "nationality", "id": 10, "metadata": {"definition": "nationality of this person or entity"}}, 
                  {"text": "child", "id": 17, "metadata": {"definition": "child of this person"}}, 
                  {"text": "founded by", "id": 0, "metadata": {"definition": "founder or co-founder of this organization, religion or place"}}, 
                  {"text": "residence", "id": 18, "metadata": {"definition": "place where this person has lived"}},
                  ...
              ]
          ]
      ]
  ),
)

Models

Models can be found on ๐Ÿค— Hugging Face.

Usage

Retrievers and Readers can be used separately. In the case of retriever-only ReLiK, the output will contain the candidates for the input text.

Reader-only example:

from relik import Relik
from relik.inference.data.objects import RelikOutput

# If you want to use only the retriever
retriever = Relik.from_pretrained("sapienzanlp/relik-entity-linking-large", reader=None)
relik_out: RelikOutput = retriever("Michael Jordan was one of the best players in the NBA.")

Output:

RelikOutput(
  text="Michael Jordan was one of the best players in the NBA.",
  tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
  id=0,
  spans=[],
  triples=[],
  candidates=Candidates(
      span=[
              [
                  {"text": "Michael Jordan", "id": 4484083},
                  {"text": "National Basketball Association", "id": 5209815},
                  {"text": "Walter Jordan", "id": 2340190},
                  {"text": "Jordan", "id": 3486773},
                  {"text": "50 Greatest Players in NBA History", "id": 1742909},
                  ...
              ]
      ],
      triplet=[],
  ),
)

Retriever-only example:

from relik import Relik
from relik.inference.data.objects import RelikOutput

# If you want to use only the reader
reader = Relik.from_pretrained("sapienzanlp/relik-entity-linking-large", retriever=None)
candidates = [
    "Michael Jordan",
    "National Basketball Association",
    "Walter Jordan",
    "Jordan",
    "50 Greatest Players in NBA History",
]
text = "Michael Jordan was one of the best players in the NBA."
relik_out: RelikOutput = reader(text, candidates=candidates)

Output:

RelikOutput(
  text="Michael Jordan was one of the best players in the NBA.",
  tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
  id=0,
  spans=[
      Span(start=0, end=14, label="Michael Jordan", text="Michael Jordan"),
      Span(start=50, end=53, label="National Basketball Association", text="NBA"),
  ],
  triples=[],
  candidates=Candidates(
      span=[
          [
              [
                  {
                      "text": "Michael Jordan",
                      "id": -731245042436891448,
                  },
                  {
                      "text": "National Basketball Association",
                      "id": 8135443493867772328,
                  },
                  {
                      "text": "Walter Jordan",
                      "id": -5873847607270755146,
                      "metadata": {},
                  },
                  {"text": "Jordan", "id": 6387058293887192208, "metadata": {}},
                  {
                      "text": "50 Greatest Players in NBA History",
                      "id": 2173802663468652889,
                  },
              ]
          ]
      ],
  ),
)

CLI

ReLiK provides a CLI to perform inference on a text file or a directory of text files. The CLI can be used as follows:

relik inference --help

  Usage: relik inference [OPTIONS] MODEL_NAME_OR_PATH INPUT_PATH OUTPUT_PATH

โ•ญโ”€ Arguments โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *    model_name_or_path      TEXT  [default: None] [required]                                           โ”‚
โ”‚ *    input_path              TEXT  [default: None] [required]                                           โ”‚
โ”‚ *    output_path             TEXT  [default: None] [required]                                           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --batch-size                               INTEGER  [default: 8]                                        โ”‚
โ”‚ --num-workers                              INTEGER  [default: 4]                                        โ”‚
โ”‚ --device                                   TEXT     [default: cuda]                                     โ”‚
โ”‚ --precision                                TEXT     [default: fp16]                                     โ”‚
โ”‚ --top-k                                    INTEGER  [default: 100]                                      โ”‚
โ”‚ --window-size                              INTEGER  [default: None]                                     โ”‚
โ”‚ --window-stride                            INTEGER  [default: None]                                     โ”‚
โ”‚ --annotation-type                          TEXT     [default: char]                                     โ”‚
โ”‚ --progress-bar        --no-progress-bar             [default: progress-bar]                             โ”‚
โ”‚ --model-kwargs                             TEXT     [default: None]                                     โ”‚
โ”‚ --inference-kwargs                         TEXT     [default: None]                                     โ”‚
โ”‚ --help                                              Show this message and exit.                         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

For example:

relik inference sapienzanlp/relik-entity-linking-large data.txt output.jsonl

๐Ÿ“š Before You Start

In the following sections, we provide a step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.

Entity Linking

All your data should have the following structure:

{
  "doc_id": int,  # Unique identifier for the document
  "doc_text": txt,  # Text of the document
  "doc_span_annotations": # Char level annotations
    [
      [start, end, label],
      [start, end, label],
      ...
    ]
}

We used BLINK (Wu et al., 2019) and AIDA (Hoffart et al, 2011) datasets for training and evaluation. More specifically, we used the BLINK dataset for pre-training the retriever and the AIDA dataset for fine-tuning the retriever and training the reader.

The BLINK dataset can be downloaded from the GENRE repo using this script. We used blink-train-kilt.jsonl and blink-dev-kilt.jsonl as training and validation datasets. Assuming we have downloaded the two files in the data/blink folder, we converted the BLINK dataset to the ReLiK format using the following script:

# Train
python scripts/data/blink/preprocess_genre_blink.py \
  data/blink/blink-train-kilt.jsonl \
  data/blink/processed/blink-train-kilt-relik.jsonl

# Dev
python scripts/data/blink/preprocess_genre_blink.py \
  data/blink/blink-dev-kilt.jsonl \
  data/blink/processed/blink-dev-kilt-relik.jsonl

The AIDA dataset is not publicly available, but we provide the file we used without text field. You can find the file in ReLiK format in data/aida/processed folder.

The Wikipedia index we used can be downloaded from here.

Relation Extraction

All your data should have the following structure:

{
  "doc_id": int,  # Unique identifier for the document
  "doc_words: list[txt] # Tokenized text of the document
  "doc_span_annotations": # Token level annotations of mentions (label is optional)
    [
      [start, end, label],
      [start, end, label],
      ...
    ],
  "doc_triplet_annotations": # Triplet annotations
  [
    {
      "subject": [start, end, label], # label is optional
      "relation": name, # type is optional
      "object": [start, end, label], # label is optional
    },
    {
      "subject": [start, end, label], # label is optional
      "relation": name, # type is optional
      "object": [start, end, label], # label is optional
    },
  ]
}

For Relation Extraction, we provide an example of how to preprocess the NYT dataset from raw_nyt taken from CopyRE. Download the dataset to data/raw_nyt and then run:

python scripts/data/nyt/preprocess_nyt.py data/raw_nyt data/nyt/processed/

Please be aware that for fair comparison we reproduced the preprocessing from previous work, which leads to duplicate triplets due to the incorrect handling of repeated surface forms for entity spans. If you want to correctly parse the original data to ReLiK format, you can set the flag --legacy-format False. Just be aware that the provided RE NYT models were trained on the legacy format.

๐Ÿฆฎ Retriever

We perform a two-step training process for the retriever. First, we "pre-train" the retriever using BLINK (Wu et al., 2019) dataset, and then we "fine-tune" it using AIDA (Hoffart et al, 2011).

Data Preparation

The retriever requires a dataset in a format similar to DPR: a jsonl file where each line is a dictionary with the following keys:

{
  "question": "....",
  "positive_ctxs": [{
    "title": "...",
    "text": "...."
  }],
  "negative_ctxs": [{
    "title": "...",
    "text": "...."
  }],
  "hard_negative_ctxs": [{
    "title": "...",
    "text": "...."
  }]
}

The retriever also needs an index to search for the documents. The documents to index can be either a JSONL file or a TSV file similar to DPR:

  • jsonl: each line is a JSON object with the following keys: id, text, metadata
  • tsv: each line is a tab-separated string with the id and text columns, followed by any other column that will be stored in the metadata field

jsonl example:

{
  "id": "...",
  "text": "...",
  "metadata": ["{...}"]
},
...

tsv example:

id \t text \t any other column
...

Entity Linking

BLINK

Once you have the BLINK dataset in the ReLiK format, you can create the windows with the following script:

# train
relik data create-windows \
  data/blink/processed/blink-train-kilt-relik.jsonl \
  data/blink/processed/blink-train-kilt-relik-windowed.jsonl

# dev
relik data create-windows \
  data/blink/processed/blink-dev-kilt-relik.jsonl \
  data/blink/processed/blink-dev-kilt-relik-windowed.jsonl

and then convert it to the DPR format:

# train
relik data convert-to-dpr \
  data/blink/processed/blink-train-kilt-relik-windowed.jsonl \
  data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl \
  data/kb/wikipedia/documents.jsonl \
  --title-map data/kb/wikipedia/title_map.json

# dev
relik data convert-to-dpr \
  data/blink/processed/blink-dev-kilt-relik-windowed.jsonl \
  data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl \
  data/kb/wikipedia/documents.jsonl \
  --title-map data/kb/wikipedia/title_map.json
AIDA

Since the AIDA dataset is not publicly available, we can provide the annotations for the AIDA dataset in the ReLiK format as an example. Assuming you have the full AIDA dataset in the data/aida, you can convert it to the ReLiK format and then create the windows with the following script:

relik data create-windows \
  data/aida/processed/aida-train-relik.jsonl \
  data/aida/processed/aida-train-relik-windowed.jsonl

and then convert it to the DPR format:

relik data convert-to-dpr \
  data/aida/processed/aida-train-relik-windowed.jsonl \
  data/aida/processed/aida-train-relik-windowed-dpr.jsonl \
  data/kb/wikipedia/documents.jsonl \
  --title-map data/kb/wikipedia/title_map.json

Relation Extraction

NYT
relik data create-windows \
  data/data/processed/nyt/train.jsonl \
  data/data/processed/nyt/train-windowed.jsonl \
  --is-split-into-words \
  --window-size none 

and then convert it to the DPR format:

relik data convert-to-dpr \
  data/data/processed/nyt/train-windowed.jsonl \
  data/data/processed/nyt/train-windowed-dpr.jsonl

Training the model

The relik retriever train command can be used to train the retriever. It requires the following arguments:

  • config_path: The path to the configuration file.
  • overrides: A list of overrides to the configuration file, in the format key=value.

Examples of configuration files can be found in the relik/retriever/conf folder.

Entity Linking

The configuration files in relik/retriever/conf are pretrain_iterable_in_batch.yaml and finetune_iterable_in_batch.yaml, which we used to pre-train and fine-tune the retriever, respectively.

For instance, to train the retriever on the AIDA dataset, you can run the following command:

relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \
  model.language_model=intfloat/e5-base-v2 \
  data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl \
  data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl \
  data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl \
  data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl

Relation Extraction

The configuration file in relik/retriever/conf is finetune_nyt_iterable_in_batch.yaml, which we used to fine-tune the retriever for the NYT dataset. For cIE we repurpose the one pretrained from BLINK in the previous step.

For instance, to train the retriever on the NYT dataset, you can run the following command:

relik retriever train relik/retriever/conf/finetune_nyt_iterable_in_batch.yaml \
  model.language_model=intfloat/e5-base-v2 \
  data.train_dataset_path=data/nyt/processed/nyt-train-relik-windowed-dpr.jsonl \
  data.val_dataset_path=data/nyt/processed/nyt-dev-relik-windowed-dpr.jsonl \
  data.test_dataset_path=data/nyt/processed/nyt-test-relik-windowed-dpr.jsonl

Inference

By passing train.only_test=True to the relik retriever train command, you can skip the training and only evaluate the model. It needs also the path to the PyTorch Lightning checkpoint and the dataset to evaluate on.

relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \
  train.only_test=True \
  test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
  model.checkpoint_path=path/to/checkpoint

The retriever encoder can be saved from the checkpoint with the following command:

from relik.retriever.lightning_modules.pl_modules import GoldenRetrieverPLModule

checkpoint_path = "path/to/checkpoint"
retriever_folder = "path/to/retriever"

# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"

pl_module = GoldenRetrieverPLModule.load_from_checkpoint(checkpoint_path)
pl_module.model.save_pretrained(retriever_folder, push_to_hub=push_to_hub, repo_id=repo_id)

With push_to_hub=True the model will be pushed to the ๐Ÿค— Hugging Face Hub with repo_id as the repository id where the model will be pushed.

The retriever needs an index to search for the documents. The index can be created using relik retriever build-index command

relik retriever build-index --help 

 Usage: relik retriever build-index [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH                                                                   
                                    DOCUMENT_PATH OUTPUT_FOLDER                                                                                                                                              
โ•ญโ”€ Arguments โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *    question_encoder_name_or_path      TEXT  [default: None] [required]                                                                   โ”‚
โ”‚ *    document_path                      TEXT  [default: None] [required]                                                                   โ”‚
โ”‚ *    output_folder                      TEXT  [default: None] [required]                                                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --document-file-type                                  TEXT     [default: jsonl]                                                            โ”‚
โ”‚ --passage-encoder-name-or-path                        TEXT     [default: None]                                                             โ”‚
โ”‚ --indexer-class                                       TEXT     [default: relik.retriever.indexers.inmemory.InMemoryDocumentIndex]          โ”‚
โ”‚ --batch-size                                          INTEGER  [default: 512]                                                              โ”‚
โ”‚ --num-workers                                         INTEGER  [default: 4]                                                                โ”‚
โ”‚ --passage-max-length                                  INTEGER  [default: 64]                                                               โ”‚
โ”‚ --device                                              TEXT     [default: cuda]                                                             โ”‚
โ”‚ --index-device                                        TEXT     [default: cpu]                                                              โ”‚
โ”‚ --precision                                           TEXT     [default: fp32]                                                             โ”‚
โ”‚ --push-to-hub                     --no-push-to-hub             [default: no-push-to-hub]                                                   โ”‚
โ”‚ --repo-id                                             TEXT     [default: None]                                                             โ”‚
โ”‚ --help                                                         Show this message and exit.                                                 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

With the encoder and the index, the retriever can be loaded from a repo id or a local path:

from relik.retriever import GoldenRetriever

encoder_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"
index_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index"

retriever = GoldenRetriever(
  question_encoder=encoder_name_or_path,
  document_index=index_name_or_path,
  device="cuda", # or "cpu"
  precision="16", # or "32", "bf16"
  index_device="cuda", # or "cpu"
  index_precision="16", # or "32", "bf16"
)

and then it can be used to retrieve documents:

retriever.retrieve("Michael Jordan was one of the best players in the NBA.", top_k=100)

๐Ÿค“ Reader

The reader is responsible for extracting entities and relations from documents from a set of candidates (e.g., possible entities or relations). The reader can be trained for span extraction or triplet extraction. The RelikReaderForSpanExtraction is used for span extraction, i.e. Entity Linking, while the RelikReaderForTripletExtraction is used for triplet extraction, i.e. Relation Extraction.

Data Preparation

The reader requires the windowized dataset we created in Section Before You Start augmented with the candidates from the retriever. The candidates can be added to the dataset using the relik retriever add-candidates command.

relik retriever add-candidates --help

 Usage: relik retriever add-candidates [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH                                 
                                       DOCUMENT_NAME_OR_PATH INPUT_PATH                                        
                                       OUTPUT_PATH

โ•ญโ”€ Arguments โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *    question_encoder_name_or_path      TEXT  [default: None] [required]                                    โ”‚
โ”‚ *    document_name_or_path              TEXT  [default: None] [required]                                    โ”‚
โ”‚ *    input_path                         TEXT  [default: None] [required]                                    โ”‚
โ”‚ *    output_path                        TEXT  [default: None] [required]                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --passage-encoder-name-or-path                           TEXT     [default: None]                           โ”‚
โ”‚ --relations                                              BOOLEAN  [default: False]                          โ”‚
โ”‚ --top-k                                                  INTEGER  [default: 100]                            โ”‚
โ”‚ --batch-size                                             INTEGER  [default: 128]                            โ”‚
โ”‚ --num-workers                                            INTEGER  [default: 4]                              โ”‚
โ”‚ --device                                                 TEXT     [default: cuda]                           โ”‚
โ”‚ --index-device                                           TEXT     [default: cpu]                            โ”‚
โ”‚ --precision                                              TEXT     [default: fp32]                           โ”‚
โ”‚ --use-doc-topics                  --no-use-doc-topics             [default: no-use-doc-topics]              โ”‚
โ”‚ --help                                                            Show this message and exit.               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Entity Linking

We need to add candidates to each window that will be used by the Reader, using our previously trained Retriever. Here is an example using our already trained retriever on Aida for the train split:

relik retriever add-candidates sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index data/aida/processed/aida-train-relik-windowed.jsonl data/aida/processed/aida-train-relik-windowed-candidates.jsonl

Relation Extraction

The same thing happens for Relation Extraction. If you want to use our trained retriever:

relik retriever add-candidates sapienzanlp/relik-retriever-small-nyt-question-encoder sapienzanlp/relik-retriever-small-nyt-document-index data/nyt/processed/nyt-train-relik-windowed.jsonl data/nyt/processed/nyt-train-relik-windowed-candidates.jsonl

Training the model

Similar to the retriever, the relik reader train command can be used to train the retriever. It requires the following arguments:

  • config_path: The path to the configuration file.
  • overrides: A list of overrides to the configuration file, in the format key=value.

Examples of configuration files can be found in the relik/reader/conf folder.

Entity Linking

The configuration files in relik/reader/conf are large.yaml and base.yaml, which we used to train the large and base reader, respectively. For instance, to train the large reader on the AIDA dataset run:

relik reader train relik/reader/conf/large.yaml \
  train_dataset_path=data/aida/processed/aida-train-relik-windowed-candidates.jsonl \
  val_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl \
  test_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl

Relation Extraction

The configuration files in relik/reader/conf are large_nyt.yaml, base_nyt.yaml, and small_nyt.yaml, which we used to train the large, base and small reader, respectively. For instance, to train the large reader on the AIDA dataset run:

relik reader train relik/reader/conf/large_nyt.yaml \
  train_dataset_path=data/nyt/processed/nyt-train-relik-windowed-candidates.jsonl \
  val_dataset_path=data/nyt/processed/nyt-dev-relik-windowed-candidates.jsonl \
  test_dataset_path=data/nyt/processed/nyt-test-relik-windowed-candidates.jsonl

Inference

The reader can be saved from the checkpoint with the following command:

from relik.reader.lightning_modules.relik_reader_pl_module import RelikReaderPLModule

checkpoint_path = "path/to/checkpoint"
reader_folder = "path/to/reader"

# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-reader-deberta-v3-large-aida"

pl_model = RelikReaderPLModule.load_from_checkpoint(
    trainer.checkpoint_callback.best_model_path
)
pl_model.relik_reader_core_model.save_pretrained(experiment_path, push_to_hub=push_to_hub, repo_id=repo_id)

with push_to_hub=True the model will be pushed to the ๐Ÿค— Hugging Face Hub with repo_id as the repository id where the model will be uploaded.

The reader can be loaded from a repo id or a local path:

from relik.reader import RelikReaderForSpanExtraction, RelikReaderForTripletExtraction

# the reader for span extraction
reader_span = RelikReaderForSpanExtraction(
  "sapienzanlp/relik-reader-deberta-v3-large-aida"
)
# the reader for triplet extraction
reader_tripltes = RelikReaderForTripletExtraction(
  "sapienzanlp/relik-reader-deberta-v3-large-nyt"
)

and used to extract entities and relations:

# an example of candidates for the reader
candidates = ["Michael Jordan", "NBA", "Chicago Bulls", "Basketball", "United States"]
reader_span.read("Michael Jordan was one of the best players in the NBA.", candidates=candidates)

๐Ÿ“Š Performance

Entity Linking

We evaluate the performance of ReLiK on Entity Linking using GERBIL. The following table shows the results (InKB Micro F1) of ReLiK Large and Base:

Model AIDA MSNBC Der K50 R128 R500 O15 O16 Tot OOD AIT (m:s)
GENRE 83.7 73.7 54.1 60.7 46.7 40.3 56.1 50.0 58.2 54.5 38:00
EntQA 85.8 72.1 52.9 64.5 54.1 41.9 61.1 51.3 60.5 56.4 20:00
ReLiKBase 85.3 72.3 55.6 68.0 48.1 41.6 62.5 52.3 60.7 57.2 00:29
ReLiKLarge 86.4 75.0 56.3 72.8 51.7 43.0 65.1 57.2 63.4 60.2 01:46

Comparison systems' evaluation (InKB Micro F1) on the in-domain AIDA test set and out-of-domain MSNBC (MSN), Derczynski (Der), KORE50 (K50), N3-Reuters-128 (R128), N3-RSS-500 (R500), OKE-15 (O15), and OKE-16 (O16) test sets. Bold indicates the best model. GENRE uses mention dictionaries. The AIT column shows the time in minutes and seconds (m:s) that the systems need to process the whole AIDA test set using an NVIDIA RTX 4090, except for EntQA which does not fit in 24GB of RAM and for which an A100 is used.

To evaluate ReLiK we use the following steps:

  1. Download the GERBIL server from here.

  2. Start the GERBIL server:

cd gerbil && ./start.sh
  1. Start the following services:
cd gerbil-SpotWrapNifWS4Test && mvn clean -Dmaven.tomcat.port=1235 tomcat:run
  1. Start the ReLiK server for GERBIL providing the model name as an argument (e.g. sapienzanlp/relik-entity-linking-large):
python relik/reader/utils/gerbil_server.py --relik-model-name sapienzanlp/relik-entity-linking-large
  1. Open the URL http://localhost:1234/gerbil and:
    • Select A2KB as experiment type
    • Select "Ma - strong annotation match"
    • In the Name field write the name you want to give to the experiment
    • In the URI field write: http://localhost:1235/gerbil-spotWrapNifWS4Test/myalgorithm
    • Select the datasets (We use AIDA-B, MSNBC, Der, K50, R128, R500, OKE15, OKE16)
    • Finally, run experiment

Relation Extraction

The following table shows the results (Micro F1) of ReLiK Large on the NYT dataset:

Model NYT NYT (Pretr) AIT (m:s)
REBEL 93.1 93.4 01:45
UiE 93.5 -- --
USM 94.0 94.1 --
ReLiKLarge 95.0 94.9 00:30

To evaluate Relation Extraction we can directly use the reader with the script relik/reader/trainer/predict_re.py, pointing at the file with already retrieved candidates. If you want to use our trained Reader:

python relik/reader/trainer/predict_re.py --model_path sapienzanlp/relik-reader-deberta-v3-large-nyt --data_path /Users/perelluis/Documents/relik/data/debug/test.window.candidates.jsonl --is-eval

Be aware that we compute the threshold for predicting relations based on the development set. To compute it while evaluating you can run the following:

python relik/reader/trainer/predict_re.py --model_path sapienzanlp/relik-reader-deberta-v3-large-nyt --data_path /Users/perelluis/Documents/relik/data/debug/dev.window.candidates.jsonl --is-eval --compute-threshold

๐Ÿ’ฝ Cite this work

If you use any part of this work, please consider citing the paper as follows:

@inproceedings{orlando-etal-2024-relik,
    title     = "Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget",
    author    = "Orlando, Riccardo and Huguet Cabot, Pere-Llu{\'\i}s and Barba, Edoardo and Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month     = aug,
    year      = "2024",
    address   = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
}

๐Ÿชช License

The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

relik-1.0.1.tar.gz (190.8 kB view details)

Uploaded Source

Built Distribution

relik-1.0.1-py3-none-any.whl (216.9 kB view details)

Uploaded Python 3

File details

Details for the file relik-1.0.1.tar.gz.

File metadata

  • Download URL: relik-1.0.1.tar.gz
  • Upload date:
  • Size: 190.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for relik-1.0.1.tar.gz
Algorithm Hash digest
SHA256 ff887f4f2fc45134eba721fa60e407700be1f448df890c59b0f2ed5ad04603e2
MD5 7d751abee25accdf1f9deccf3da196b5
BLAKE2b-256 afbe228f8b8ab68ee6d204198213f1c127e30f43ffbb23cf51e3f2acc273124a

See more details on using hashes here.

File details

Details for the file relik-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: relik-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 216.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for relik-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ce432e36fc61a36f4bb9d17511650d9809a14aeffe0599f6fc6ce2dce9cfedfd
MD5 1455b8143505013a835439c38d06aa26
BLAKE2b-256 c89d82a8203b47f023c81223e4e134d7bb9117a8b23db0488ab648936f0927a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page