A score-based implementation of WordPiece tokenization training, compatible with HuggingFace tokenizers.
Project description
real-wordpiece
This repository provides a Python implementation of the WordPiece
tokenizer, which is compatible with the 🤗
tokenizers.
Motivation
🤗 tokenizers provides state-of-the-art text tokenizer implementations. They are also used in the 🤗 transformers, helping to convert text into a format that might be fed into LLMs or embedding models. 🤗 tokenizers are broadly adopted in the NLP community, and became the de-facto standard for tokenization, providing models such as:
Surprisingly, the WordPiece
tokenizer described in the brilliant Hugging Face NLP
Course is not the same as the one implemented in the 🤗 tokenizers library.
That fact is not well-known and vaguely documented. Instead of using the original WordPiece
algorithm, it uses a
##
prefix to indicate the continuation of the word and then uses the BPE algorithm to merge the tokens. This process
is mentioned in the NLP Course, but might be surprising for
those who haven't read it.
This library fills the gap by providing a Python implementation of the original WordPiece
tokenizer, in a way that is
described in the course.
Installation
real-wordpiece
can be installed from PyPI using pip or the package manager of your choice:
pip install real-wordpiece
Usage
Since the 🤗 tokenizers library is written in Rust, it is not possible to directly extend its interfaces with Python.
Thus, the real-wordpiece
package provides a Python implementation of the WordPiece
tokenizer, which can produce a
compatible model, but its interface is slightly different.
from tokenizers.models import WordPiece
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.tokenizers import Tokenizer
from real_wordpiece.trainer import RealWordPieceTrainer
# Create the Tokenizer, in the same way as you would with the 🤗 tokenizers
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()
# Define the training data
training_data = [
"walker walked a long walk",
]
# Finally, train the tokenizer using the RealWordPieceTrainer
trainer = RealWordPieceTrainer(vocab_size=28, special_tokens=["[UNK]"])
trainer.train_tokenizer(training_data, tokenizer)
# The tokenizer.model will be now an instance of WordPiece trained above
print(tokenizer.encode("walker walked a long walk").tokens)
# Out: ['walk', '##er', 'walk', '##ed', 'a', 'long', 'walk']
In real-world applications the training corpus should be much larger, and the vocab_size
should be set to a higher
value.
WordPiece basics
WordPiece and Byte-Pair Encoding (BPE) are two of the most popular subword tokenization algorithms, and they have much in common. Let's consider and example and assume we have just a single word in our corpus.
word = "reappear"
The training process of the BPE algorithm starts with a vocabulary that contains all the characters.
vocab = {"r", "e", "a", "p"}
The algorithm iteratively merges the most frequent pair of tokens in the vocabulary and adds it to the vocabulary,
until the vocabulary reaches the desired size. In the case of the word "reappear", the BPE algorithm would merge the
pair ("e", "a")
to create the token "ea"
.
vocab = {"r", "e", "a", "p", "ea"}
The process would continue until the vocabulary reaches the desired size or there are no more pairs to merge.
The WordPiece algorithm is similar to BPE, but it distinguishes first letters of words from the middle letters. The
middle letters are prefixed with ##
. The WordPiece algorithm starts with a vocabulary that contains all the characters
and also the middle letters.
vocab = {"r", "e", "a", "p", "##r", "##e", "##a", "##p"}
WordPieces also uses a different heuristic to select the pair of tokens to merge. Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the score function that is defined as:
$$ score(u, v) = \frac{frequency(u, v)}{frequency(u) \dot frequency(v)} $$
Where $u$ and $v$ are tokens, $frequency(u, v)$ is the frequency of the pair in the corpus, and $frequency(u)$ and $frequency(v)$ are the frequencies of the tokens $u$ and $v$ alone. BPE merges are a bit more intuitive, and both algorithms may lead to different tokenization.
Why does that matter?
Choosing the tokenization is another hyperparameter that can significantly affect the performance of the model. Large Language Models' ability to solve different tasks might be limited by the tokenization algorithm used.
References
The importance of the tokenization model is becoming more and more apparent. The following resources provide more information on the topic:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for real_wordpiece-0.1.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e79ca54fb72ebf13e8e8b386510b6c3bbe2fa195cb7937eedf2a9b3a1612403 |
|
MD5 | 9135bbcf650d6525593ca06c60d82605 |
|
BLAKE2b-256 | 8ffb08f5c5fa672cc8eedd107fb3bab92fe56317eae9b4b1bb9f5301fd6f1a6f |