Skip to main content

A score-based implementation of WordPiece tokenization training, compatible with HuggingFace tokenizers.

Project description

real-wordpiece

This repository provides a Python implementation of the WordPiece tokenizer, which is compatible with the 🤗 tokenizers.

Motivation

🤗 tokenizers provides state-of-the-art text tokenizer implementations. They are also used in the 🤗 transformers, helping to convert text into a format that might be fed into LLMs or embedding models. 🤗 tokenizers are broadly adopted in the NLP community, and became the de-facto standard for tokenization, providing models such as:

Surprisingly, the WordPiece tokenizer described in the brilliant Hugging Face NLP Course is not the same as the one implemented in the 🤗 tokenizers library. That fact is not well-known and vaguely documented. Instead of using the original WordPiece algorithm, it uses a ## prefix to indicate the continuation of the word and then uses the BPE algorithm to merge the tokens. This process is mentioned in the NLP Course, but might be surprising for those who haven't read it.

HF tokenizers implementation of WordPiece

This library fills the gap by providing a Python implementation of the original WordPiece tokenizer, in a way that is described in the course.

Installation

real-wordpiece can be installed from PyPI using pip or the package manager of your choice:

pip install real-wordpiece

Usage

Since the 🤗 tokenizers library is written in Rust, it is not possible to directly extend its interfaces with Python. Thus, the real-wordpiece package provides a Python implementation of the WordPiece tokenizer, which can produce a compatible model, but its interface is slightly different.

from tokenizers.models import WordPiece
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.tokenizers import Tokenizer

from real_wordpiece.trainer import RealWordPieceTrainer

# Create the Tokenizer, in the same way as you would with the 🤗 tokenizers
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()

# Define the training data
training_data = [
    "walker walked a long walk",
]

# Finally, train the tokenizer using the RealWordPieceTrainer
trainer = RealWordPieceTrainer(vocab_size=28, special_tokens=["[UNK]"])
trainer.train_tokenizer(training_data, tokenizer)

# The tokenizer.model will be now an instance of WordPiece trained above
print(tokenizer.encode("walker walked a long walk").tokens)
# Out: ['walk', '##er', 'walk', '##ed', 'a', 'long', 'walk']

In real-world applications the training corpus should be much larger, and the vocab_size should be set to a higher value.

WordPiece basics

WordPiece and Byte-Pair Encoding (BPE) are two of the most popular subword tokenization algorithms, and they have much in common. Let's consider and example and assume we have just a single word in our corpus.

word = "reappear"

The training process of the BPE algorithm starts with a vocabulary that contains all the characters.

vocab = {"r", "e", "a", "p"}

The algorithm iteratively merges the most frequent pair of tokens in the vocabulary and adds it to the vocabulary, until the vocabulary reaches the desired size. In the case of the word "reappear", the BPE algorithm would merge the pair ("e", "a") to create the token "ea".

vocab = {"r", "e", "a", "p", "ea"}

The process would continue until the vocabulary reaches the desired size or there are no more pairs to merge.

The WordPiece algorithm is similar to BPE, but it distinguishes first letters of words from the middle letters. The middle letters are prefixed with ##. The WordPiece algorithm starts with a vocabulary that contains all the characters and also the middle letters.

vocab = {"r", "e", "a", "p", "##r", "##e", "##a", "##p"}

WordPieces also uses a different heuristic to select the pair of tokens to merge. Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the score function that is defined as:

$$ score(u, v) = \frac{frequency(u, v)}{frequency(u) \dot frequency(v)} $$

Where $u$ and $v$ are tokens, $frequency(u, v)$ is the frequency of the pair in the corpus, and $frequency(u)$ and $frequency(v)$ are the frequencies of the tokens $u$ and $v$ alone. BPE merges are a bit more intuitive, and both algorithms may lead to different tokenization.

Why does that matter?

Choosing the tokenization is another hyperparameter that can significantly affect the performance of the model. Large Language Models' ability to solve different tasks might be limited by the tokenization algorithm used.

References

The importance of the tokenization model is becoming more and more apparent. The following resources provide more information on the topic:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

real_wordpiece-0.1.5.tar.gz (10.9 kB view hashes)

Uploaded Source

Built Distribution

real_wordpiece-0.1.5-py3-none-any.whl (11.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page