Word level transformer based embeddings

These details have not been verified by PyPI

Project links

Homepage

Project description

Transformers Embedder

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

How to use

Install the library from PyPI:

pip install transformers-embedder

or from Conda:

conda install -c riccorl transformers-embedder

It offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface 🤗Transformers library. Here is a quick example:

import transformers_embedder as tre

tokenizer = tre.Tokenizer("bert-base-cased")
model = tre.TransformersEmbedder("bert-base-cased", return_words="mean", output_layer="sum")

example = "This is a sample sentence"
inputs = tokenizer(example, return_tensors=True)

{
   'input_ids': tensor([[ 101, 1188, 1110,  170, 6876, 5650,  102]]),
   'attention_mask': tensor([[True, True, True, True, True, True, True]]),
   'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
   'offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),
   'sentence_length': 7  # with special tokens included
}

outputs = model(**inputs)

# outputs.shape[1:-1]       # remove [CLS] and [SEP]
torch.Size([1, 5, 768])
# len(example)
5

Info

One of the annoyance of using transfomer-based models is that it is not trivial to compute word embeddings from the sub-token embeddings they output. With this API it's as easy as using 🤗Transformers to get word-level embeddings from theoretically every transformer model it supports.

Model

The TransformersEmbedder offer 2 ways to retrieve the embeddings:

return_words=True: computes the mean of the embeddings of the sub-tokens of each word
return_words=False: returns the raw output of the transformer model without sub-token pooling

There are also multiple type of outputs you can get using output_layer parameter:

last: returns the last hidden state of the transformer model
concat: returns the concatenation of the last four hidden states of the transformer model
sum: returns the sum of the last four hidden states of the transformer model
pooled: returns the output of the pooling layer

If you also want all the outputs from the HuggingFace model, you can set return_all=True to get them.

class TransformersEmbedder(torch.nn.Module):
    def __init__(
        self,
        model: Union[str, tr.PreTrainedModel],
        return_words: bool = True,
        output_layer: str = "last",
        fine_tune: bool = True,
        return_all: bool = False,
    )

Tokenizer

The Tokenizer class provides the tokenize method to preprocess the input for the TransformersEmbedder layer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them returning a dictionary with the inputs for the model. By passing return_tensors=True it will return the inputs as torch.Tensor.

By default, if you pass text (or batch) as strings, it splits them on spaces

text = "This is a sample sentence"
tokenizer(text)

text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text)

You can also use SpaCy to pre-tokenize the inputs into words first, using use_spacy=True

text = "This is a sample sentence"
tokenizer(text, use_spacy=True)

text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text, use_spacy=True)

or you can pass an pre-tokenized sentence (or batch of sentences) by setting is_split_into_words=True

text = ["This", "is", "a", "sample", "sentence"]
tokenizer(text, is_split_into_words=True)

text = [
    ["This", "is", "a", "sample", "sentence", "1"],
    ["This", "is", "sample", "sentence", "2"],
]
tokenizer(text, is_split_into_words=True) # here is_split_into_words is redundant

Examples

First, initialize the tokenizer

import transformers_embedder as tre

tokenizer = tre.Tokenizer("bert-base-cased")

You can pass a single sentence as a string:

text = "This is a sample sentence"
tokenizer(text)

{
  'input_ids': [101, 1188, 1110, 170, 6876, 5650, 102],
  'offsets': [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)],
  'attention_mask': [True, True, True, True, True, True, True],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0],
  'sentence_length': 7
}

A sentence pair

text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer(text, text_pair)

{
  'input_ids': [101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102],
  'attention_mask': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  'offsets': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
  'sentence_length': 15
}

A batch of sentences or sentence pairs. Using padding=True and return_tensors=True, the tokenizer returns the text ready for the model

batch = [
    ["This", "is", "a", "sample", "sentence", "1"],
    ["This", "is", "sample", "sentence", "2"],
    ["This", "is", "a", "sample", "sentence", "3"],
    # ...
    ["This", "is", "a", "sample", "sentence", "n", "for", "batch"],
]
tokenizer(batch, padding=True, return_tensors=True)

batch_pair = [
    ["This", "is", "a", "sample", "sentence", "pair", "1"],
    ["This", "is", "sample", "sentence", "pair", "2"],
    ["This", "is", "a", "sample", "sentence", "pair", "3"],
    # ...
    ["This", "is", "a", "sample", "sentence", "pair", "n", "for", "batch"],
]
tokenizer(batch, batch_pair, padding=True, return_tensors=True)

Custom fields

It is possible to add custom fields to the model input and tell the tokenizer how to pad them using add_padding_ops. Start by simply tokenizing the input (without padding or tensor mapping)

import transformers_embedder as tre

tokenizer = tre.Tokenizer("bert-base-cased")

text = [
    ["This", "is", "a", "sample", "sentence"],
    ["This", "is", "another", "example", "sentence", "just", "make", "it", "longer"]
]
inputs = tokenizer(text)

Then add the custom fileds to the result

custom_fields = {
  "custom_filed_1": [
    [0, 0, 0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
  ]
}

inputs.update(custom_fields)

Now we can add the padding logic for our custom field custom_filed_1. add_padding_ops method takes in input

key: name of the field in the tokenzer input
value: value to use for padding
length: length to pad. It can be an int, or two string value, subtoken in which the element is padded to the batch max length relative to the sub-tokens length, and word where the element is padded to the batch max length relative to the original word length

tokenizer.add_padding_ops("custom_filed_1", 0, "word")

Finally, pad the input and convert it to a tensor:

# manual processing
inputs = tokenizer.pad_batch(inputs)
inputs = tokenizer.to_tensor(inputs)

The inputs are ready for the model, including the custom filed.

>>> inputs

{
   "input_ids": tensor(
       [
           [101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0],
           [101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 102],
       ]
   ),
   "attention_mask": tensor(
       [
           [True, True, True, True, True, True, True, False, False, False, False],
           [True, True, True, True, True, True, True, True, True, True, True],
       ]
   ),
   "word_mask": tensor(
       [
           [True, True, True, True, True, True, True, False, False, False, False],
           [True, True, True, True, True, True, True, True, True, True, True],
       ]
   ),
   "token_type_ids": tensor(
       [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
   ),
   "offsets": tensor(
       [
           [0, 1, 2, 3, 4, 5, 6, 7, 10, 10, 10],
           [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
       ]
   ),
   "sentence_length": tensor([7, 11]),
   "custom_filed_1": tensor(
       [[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]
   ),
}

SpaCy Tokenizer

By default, it uses the multilingual model xx_sent_ud_sm. You can change it with the language parameter during the tokenizer initialization. For example, if you prefer an English tokenizer:

tokenizer = tre.Tokenizer("bert-base-cased", language="en_core_web_sm")

For a complete list of languages and models, you can go here.

To-Do

Future developments

Add an optional word tokenizer, maybe using SpaCy
Add add_special_tokens wrapper
Make pad_batch function more general
Add logic (like how to pad, etc) for custom fields
- Documentation
Include all model outputs
- Documentation
A TensorFlow version (improbable)

Acknowledgements

Some of the code in the TransformersEmbedder class is taken from the PyTorch Scatter library. The pretrained models and the core of the tokenizer is from 🤗 Transformers.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.0.11

May 31, 2023

3.0.10

May 19, 2023

3.0.9

May 19, 2023

3.0.8

Nov 3, 2022

3.0.7

Oct 21, 2022

3.0.6

Oct 10, 2022

3.0.5

Oct 10, 2022

3.0.4

Jul 28, 2022

3.0.3

Jul 7, 2022

3.0.2

Jun 17, 2022

3.0.1

May 30, 2022

3.0.0

Apr 8, 2022

3.0.0rc1 pre-release

Apr 8, 2022

2.1.0b1 pre-release

Mar 23, 2022

2.0.2

Mar 15, 2022

2.0.1

Mar 14, 2022

2.0.0

Mar 7, 2022

2.0.0b2 pre-release

Mar 7, 2022

1.9.0

Feb 18, 2022

1.8.5

Feb 9, 2022

1.8.4

Dec 11, 2021

This version

1.8.3

Dec 11, 2021

1.8.2

Oct 29, 2021

1.8.1

Oct 23, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformers_embedder-1.8.3.tar.gz (16.8 kB view details)

Uploaded Dec 11, 2021 Source

Built Distribution

transformers_embedder-1.8.3-py3-none-any.whl (14.5 kB view details)

Uploaded Dec 11, 2021 Python 3

File details

Details for the file transformers_embedder-1.8.3.tar.gz.

File metadata

Download URL: transformers_embedder-1.8.3.tar.gz
Upload date: Dec 11, 2021
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for transformers_embedder-1.8.3.tar.gz
Algorithm	Hash digest
SHA256	`42985b3ce65f8f10e65da7ad98dbd387367fef47819809200b9c29cfcd34e340`
MD5	`dfe8e44e90373972da88a9af7362724e`
BLAKE2b-256	`841045591dfbb2285a23c4d822ffbe742ab72c0dff5c71329cf42ccd51082f89`

See more details on using hashes here.

File details

Details for the file transformers_embedder-1.8.3-py3-none-any.whl.

File metadata

Download URL: transformers_embedder-1.8.3-py3-none-any.whl
Upload date: Dec 11, 2021
Size: 14.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for transformers_embedder-1.8.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`acb1791b16a9370c13ba0fe26ba13293f8ff49b4ffc29aacbc6a4f30223157b8`
MD5	`877a546aee95df5da94f5da2579726d2`
BLAKE2b-256	`329d8e12139adb530f070601c3b89bb552ec70ca37d625bf2affd903f2925566`