Word level transformer based embeddings
Project description
Transformer Embedder
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.
How to use
Install the library from PyPI:
pip install transformer-embedder
It offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface 🤗Transformers library. Here is a quick example:
import transformer_embedder as tre
model = tre.TransformerEmbedder("bert-base-cased", subtoken_pooling="mean", output_layer="sum")
tokenizer = tre.Tokenizer("bert-base-cased")
example = "This is a sample sentence"
inputs = tokenizer(example, return_tensors=True)
# {
# 'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),
# 'offsets': tensor([[[1, 1], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6]]]),
# 'attention_mask': tensor([[True, True, True, True, True, True, True]]),
# 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
# 'sentence_length': 7 # with special tokens included
# }
outputs = model(**inputs)
# outputs.shape[1:-1] # remove [CLS] and [SEP]
# torch.Size([1, 5, 768])
# len(example)
# 5
Info
One of the annoyance of using transfomer-based models is that it is not trivial to compute word embeddings from the sub-token embeddings they output. With this library it's as easy as using 🤗Transformers API to get word-level embeddings from theoretically every transformer model it supports.
Model
The TransformerEmbedder
offer 4 ways to retrieve the word embeddings, defined by subtoken_pooling
parameter:
first
: uses only the embedding of the first sub-token of each wordlast
: uses only the embedding of the last sub-token of each wordmean
: computes the mean of the embeddings of the sub-tokens of each wordnone
: returns the raw output of the transformer model without sub-token pooling
There are also multiple type of outputs you can get using output_layer
parameter:
last
: returns the last hidden state of the transformer modelconcat
: returns the concatenation of the last four hidden states of the transformer modelsum
: returns the sum of the last four hidden states of the transformer modelpooled
: returns the output of the pooling layer
class TransformerEmbedder(torch.nn.Module):
def __init__(
self,
model_name: str,
subtoken_pooling: str = "first",
output_layer: str = "last",
fine_tune: bool = True,
)
Tokenizer
The Tokenizer
class provides the tokenize
method to preprocess the input for the TransformerEmbedder
layer. You
can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them returning a dictionary
with the inputs for the model. By passing return_tensors=True
it will return the inputs as torch.Tensor
.
By default, if you pass text (or batch) as strings, it splits them on spaces
text = "This is a sample sentence"
tokenizer(text)
text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text)
You can also use SpaCy to pre-tokenize the inputs into words first, using use_spacy=True
text = "This is a sample sentence"
tokenizer(text, use_spacy=True)
text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text, use_spacy=True)
or you can pass an pre-tokenized sentence (or batch of sentences) by setting is_split_into_words=True
text = ["This", "is", "a", "sample", "sentence"]
tokenizer(text, is_split_into_words=True)
text = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
]
tokenizer(text, is_split_into_words=True) # here is_split_into_words is redundant
Here some examples:
import transformer_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
text = "This is a sample sentence"
tokenizer(text)
# {
# 'input_ids': [101, 1188, 1110, 170, 6876, 5650, 102],
# 'offsets': [(1, 1), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)],
# 'attention_mask': [True, True, True, True, True, True, True],
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0],
# 'sentence_length': 7
# }
text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer(text, text_pair)
# {
# 'input_ids': [101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102],
# 'offsets': [(1, 1), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14)],
# 'attention_mask': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True],
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
# 'sentence_length': 15
# }
batch = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
["This", "is", "a", "sample", "sentence", "3"],
# ...
["This", "is", "a", "sample", "sentence", "n", "for", "batch"],
]
tokenizer(batch, padding=True, return_tensors=True)
batch_pair = [
["This", "is", "a", "sample", "sentence", "pair", "1"],
["This", "is", "sample", "sentence", "pair", "2"],
["This", "is", "a", "sample", "sentence", "pair", "3"],
# ...
["This", "is", "a", "sample", "sentence", "pair", "n", "for", "batch"],
]
tokenizer(batch, batch_pair, padding=True, return_tensors=True)
SpaCy Tokenizer
By default, it uses the multilingual model xx_sent_ud_sm
. You can change
it with the language
parameter during the tokenizer initialization. For example, if you prefer an English tokenizer:
tokenizer = tre.Tokenizer("bert-base-cased", language="en_core_web_sm")
For a complete list of languages and models, you can go here.
To-Do
Future developments
- Add an optional word tokenizer, maybe using SpaCy
- Add
add_special_tokens
wrapper - Add logic (like how to pad, etc) for custom fields
- Make
pad_batch
function more general
Acknowledgements
Most of the code in the TransformerEmbedder
class is taken from the AllenNLP
library. The pretrained models and the core of the tokenizer is from 🤗 Transformers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for transformer_embedder-1.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f8a8a957c9b3f6cfbbede6ece72417ba162e3c65c5d72e9c5ca5e6a40ff90f1 |
|
MD5 | 626993e177a14ff42defa5c7d1001cb4 |
|
BLAKE2b-256 | f752a9ae68ae0cbe9138c144a3953df088b015589f21a40704278a9516b4275d |