RelBERT: the state-of-the-art lexical relation embedding model.
Project description
RelBERT
We release the package relbert
that includes the official implementation of
Distilling Relation Embeddings from Pre-trained Language Models
which has been accepted by the EMNLP 2021 main conference
(check the camera-ready version here).
What's RelBERT?
RelBERT is the state-of-the-art lexical relation embedding model based on large scale pretrained masked language models that establishes a very strong baseline in analogy question in zeroshot transfer and even outperform fewshot models such as GPT-3 and Analogical Proportion (AP).
SAT (full) | SAT | U2 | U4 | BATS | ||
---|---|---|---|---|---|---|
GloVe | 48.9 | 47.8 | 46.5 | 39.8 | 96 | 68.7 |
FastText | 49.7 | 47.8 | 43 | 40.7 | 96.6 | 72 |
RELATIVE | 24.9 | 24.6 | 32.5 | 27.1 | 62 | 39 |
pair2vec | 33.7 | 34.1 | 25.4 | 28.2 | 66.6 | 53.8 |
GPT-2 (AP) | 41.4 | 35.9 | 41.2 | 44.9 | 80.4 | 63.5 |
RoBERTa (AP) | 49.6 | 42.4 | 49.1 | 49.1 | 90.8 | 69.7 |
GPT-2 (tuned AP) | 57.8 | 56.7 | 50.9 | 49.5 | 95.2 | 81.2 |
RoBERTa (tuned AP) | 55.8 | 53.4 | 58.3 | 57.4 | 93.6 | 78.4 |
GPT3 (zeroshot) | 53.7 | - | - | - | - | - |
GPT3 (fewshot) | 53.7 | - | - | - | - | - |
RelBERT | 69.5 | 70.6 | 66.2 | 65.3 | 92.4 | 78.8 |
Please have a look our paper to know more about RelBERT and AnalogyTool or AP paper for more information about the analogy question datasets.
What can we do with relbert
?
In this repository, we release a python package relbert
to work around with RelBERT and its checkpoints via huggingface modelhub and gensim.
In brief, what you can do with the relbert
are summarized as below:
- Get a high quality embedding vector given a pair of word
- Get similar word pairs (nearest neighbors)
- Reproduce the results of our EMNLP 2021 paper.
Get Started
pip install relbert
Play with RelBERT
RelBERT can give you a high-quality relation embedding vector of a word pair. First, you need to define the model class with a RelBERT checkpoint.
from relbert import RelBERT
model = RelBERT('asahi417/relbert-roberta-large')
As the model checkpoint, we release following three models on the huggingface modelhub.
asahi417/relbert-roberta-large
: RelBERT based on RoBERTa large with custom prompt (recommended as this is the best model in our experiments).asahi417/relbert-roberta-large-autoprompt
: RelBERT based on RoBERTa large with AutoPrompt.asahi417/relbert-roberta-large-ptuning
: RelBERT based on RoBERTa large with P-tuning.
Then you give a list of word to the model to get the embedding.
# the vector has (1024,)
v_tokyo_japan = model.get_embedding(['Tokyo', 'Japan'])
Let's run a quick experiment to check the embedding quality. Given candidate lists ['Paris', 'France']
, ['music', 'pizza']
, and ['London', 'Tokyo']
, the pair which shares
the same relation with the ['Tokyo', 'Japan']
is ['Paris', 'France']
. Would the RelBERT embedding be possible to retain it with simple cosine similarity?
from relbert import euclidean_distance
v_paris_france, v_music_pizza, v_london_tokyo = model.get_embedding([['Paris', 'France'], ['music', 'pizza'], ['London', 'Tokyo']])
euclidean_distance(v_tokyo_japan, v_paris_france)
>>> 18.8
euclidean_distance(v_tokyo_japan, v_music_pizza)
>>> 100.7
euclidean_distance(v_tokyo_japan, v_london_tokyo)
>>> 67.8
Bravo! The distance between ['Tokyo', 'Japan']
and ['Paris', 'France']
is the closest among the candidates.
Nearest Neighbours of RelBERT
To get the similar word pairs in terms of the RelBERT embedding, we convert the RelBERT embedding to a gensim model file with a fixed vocabulary.
Specifically, we take the vocabulary of the RELATIVE embedding that is released as a part of
Analogy Tool, and generate the embedding for all the word pairs with RelBERT (asahi417/relbert-roberta-large
).
Following the original vocabulary representation, words are joined by __
and multiple token should be combined by _
such as New_york__Tokyo
.
The RelBERT embedding gensim file can be found here. For example, you can get the nearest neighbours as below.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('gensim_model.bin', binary=True)
model.most_similar('Tokyo__Japan')
>>> [('Moscow__Russia', 0.9997282028198242),
('Cairo__Egypt', 0.9997045993804932),
('Baghdad__Iraq', 0.9997043013572693),
('Helsinki__Finland', 0.9996970891952515),
('Paris__France', 0.999695897102356),
('Damascus__Syria', 0.9996891617774963),
('Bangkok__Thailand', 0.9996803998947144),
('Madrid__Spain', 0.9996673464775085),
('Budapest__Hungary', 0.9996543526649475),
('Beijing__China', 0.9996539354324341)]
Reproduce the Experiments
To reproduce the experimental result of our EMNLP 2021 paper, you have to clone the repository.
git clone https://github.com/asahi417/relbert
cd relbert
pip install .
First, you need to compute prompts for AutoPrompt and P-tuning.
sh ./examples/experiments/main/prompt.sh
Then, you can train RelBERT model.
sh ./examples/experiments/main/train.sh
Once models are trained, you can evaluate them.
sh ./examples/experiments/main/evaluate.sh
Citation
If you use any of these resources, please cite the following paper:
@inproceedings{ushio-etal-2021-distilling-relation-embeddings,
title = "{D}istilling {R}elation {E}mbeddings from {P}re-trained {L}anguage {M}odels",
author = "Ushio, Asahi and
Schockaert, Steven and
Camacho-Collados, Jose",
booktitle = "EMNLP 2021",
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file relbert-0.1.0.tar.gz
.
File metadata
- Download URL: relbert-0.1.0.tar.gz
- Upload date:
- Size: 28.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12d90966c884ade5409e046add1b2aca55bcfbc816c2ea9f8f13756d8fa44e94 |
|
MD5 | a86c5470c397591425183c6aecee2bff |
|
BLAKE2b-256 | dbc8f2631b9c60dc1c2b434c45cf89235a3fb830318daf9a2bd4bc16789dcb38 |