Word level transformer based embeddings
Project description
Transformer Embedder
A Word Level Transformer layer based on Pytorch and 🤗Transformers.
How to use
Install the library
pip install transformer-embedder
It offers a Pytorch layer and a tokenizer that support almost every pretrained model from Huggingface 🤗Transformers library. Here is a quick example:
import transformer_embedder as tre
model = tre.TransformerEmbedder("bert-base-cased", subtoken_pooling="mean", output_layer="sum")
tokenizer = tre.Tokenizer("bert-base-cased")
example = "This is a sample sentence".split(" ")
inputs = tokenizer(example, return_tensor=True)
"""
{
'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),
'offsets': tensor([[[1, 1], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6]]]),
'attention_mask': tensor([[True, True, True, True, True, True, True]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
}
"""
outputs = model(**inputs)
# outputs.shape[1:-1] # remove [CLS] and [SEP]
# torch.Size([1, 5, 768])
# len(example)
# 5
Info
One of the annoyance of using transfomer-based models is that is not trivial to compute word embeddings from the sub-token embeddings that they output. With this library it's as easy as using 🤗Transformers API to get word-level embeddings from theoretically every transformer model supported by it.
Model
The TransformerEmbedder
offer 3 ways to retrieve the word embeddings:
first
: uses only the embedding of the first sub-token of each wordlast
: uses only the embedding of the last sub-token of each wordmean
: computes the mean of the embeddings of the sub-tokens of each word
There are also multiple type of outputs:
last
: returns the last hidden state of the transformer modelconcat
: returns the concatenation of the last four hidden states of the transformer modelsum
: returns the sum of the last four hidden states of the transformer modelpooled
: returns the output of the pooling layer
class TransformerEmbedder(torch.nn.Module):
def __init__(
self,
model_name: str,
subtoken_pooling: str = "first",
output_layer: str = "last",
fine_tune: bool = False,
)
Tokenizer
TO-DO
Acknowledgement
Most of the code in the TransformerEmbedder
class is taken from the AllenNLP
library. The pretrained models and the core of the tokenizer is from 🤗Transformers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for transformer_embedder-1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a81ac7f11f5da2dd73e10810066fdb7de00bbbac87ac5bfa374b07808568a34e |
|
MD5 | 72cb3c45dd1003942945f80177cb1c8b |
|
BLAKE2b-256 | a2c86c018e45448e037a111201201db03fc77adea79219ad982a1e050a1d8d9a |