Skip to main content

A score-based implementation of WordPiece tokenization training, compatible with HuggingFace tokenizers.

Project description

real-wordpiece

This repository provides a Python implementation of the WordPiece tokenizer, which is compatible with the 🤗 tokenizers.

Motivation

🤗 tokenizers provides state-of-the-art text tokenizer implementations. They are also used in the 🤗 transformers, helping to convert text into a format that might be fed into LLMs or embedding models. 🤗 tokenizers are broadly adopted in the NLP community, and became the de-facto standard for tokenization, providing models such as:

Surprisingly, the WordPiece tokenizer described in the brilliant Hugging Face NLP Course is not the same as the one implemented in the 🤗 tokenizers library. That fact is not well-known and vaguely documented. Instead of using the original WordPiece algorithm, it uses a ## prefix to indicate the continuation of the word and then uses the BPE algorithm to merge the tokens. This process is mentioned in the NLP Course, but might be surprising for those who haven't read it.

HF tokenizers implementation of WordPiece

This library fills the gap by providing a Python implementation of the original WordPiece tokenizer, in a way that is described in the course.

Installation

real-wordpiece can be installed from PyPI using pip or the package manager of your choice:

pip install real-wordpiece

Usage

Since the 🤗 tokenizers library is written in Rust, it is not possible to directly extend its interfaces with Python. Thus, the real-wordpiece package provides a Python implementation of the WordPiece tokenizer, which can produce a compatible model, but its interface is slightly different.

from tokenizers.models import WordPiece
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.tokenizers import Tokenizer

from real_wordpiece.trainer import RealWordPieceTrainer

# Create the Tokenizer, in the same way as you would with the 🤗 tokenizers
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()

# Define the training data
training_data = [
    "walker walked a long walk",
]

# Finally, train the tokenizer using the RealWordPieceTrainer
trainer = RealWordPieceTrainer(vocab_size=28, special_tokens=["[UNK]"])
trainer.train_tokenizer(training_data, tokenizer)

# The tokenizer.model will be now an instance of WordPiece trained above
print(tokenizer.encode("walker walked a long walk").tokens)
# Out: ['walk', '##er', 'walk', '##ed', 'a', 'long', 'walk']

In real-world applications the training corpus should be much larger, and the vocab_size should be set to a higher value.

WordPiece basics

WordPiece and Byte-Pair Encoding (BPE) are two of the most popular subword tokenization algorithms, and they have much in common. Let's consider and example and assume we have just a single word in our corpus.

word = "reappear"

The training process of the BPE algorithm starts with a vocabulary that contains all the characters.

vocab = {"r", "e", "a", "p"}

The algorithm iteratively merges the most frequent pair of tokens in the vocabulary and adds it to the vocabulary, until the vocabulary reaches the desired size. In the case of the word "reappear", the BPE algorithm would merge the pair ("e", "a") to create the token "ea".

vocab = {"r", "e", "a", "p", "ea"}

The process would continue until the vocabulary reaches the desired size or there are no more pairs to merge.

The WordPiece algorithm is similar to BPE, but it distinguishes first letters of words from the middle letters. The middle letters are prefixed with ##. The WordPiece algorithm starts with a vocabulary that contains all the characters and also the middle letters.

vocab = {"r", "e", "a", "p", "##r", "##e", "##a", "##p"}

WordPieces also uses a different heuristic to select the pair of tokens to merge. Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the score function that is defined as:

$$ score(u, v) = \frac{frequency(u, v)}{frequency(u) \dot frequency(v)} $$

Where $u$ and $v$ are tokens, $frequency(u, v)$ is the frequency of the pair in the corpus, and $frequency(u)$ and $frequency(v)$ are the frequencies of the tokens $u$ and $v$ alone. BPE merges are a bit more intuitive, and both algorithms may lead to different tokenization.

Why does that matter?

Choosing the tokenization is another hyperparameter that can significantly affect the performance of the model. Large Language Models' ability to solve different tasks might be limited by the tokenization algorithm used.

References

The importance of the tokenization model is becoming more and more apparent. The following resources provide more information on the topic:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

real_wordpiece-0.1.7.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

real_wordpiece-0.1.7-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file real_wordpiece-0.1.7.tar.gz.

File metadata

  • Download URL: real_wordpiece-0.1.7.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for real_wordpiece-0.1.7.tar.gz
Algorithm Hash digest
SHA256 394efa30a7f906ee5eb5575949cd972fa4d501f62752e5594cc67bf265975ca7
MD5 6592cbb3bee5faad2502fd59da3705cf
BLAKE2b-256 06dfb55aaee861e418e441805d062d6ce875fd1bd832a1591bf118e57e765f56

See more details on using hashes here.

File details

Details for the file real_wordpiece-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: real_wordpiece-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for real_wordpiece-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8e79ca54fb72ebf13e8e8b386510b6c3bbe2fa195cb7937eedf2a9b3a1612403
MD5 9135bbcf650d6525593ca06c60d82605
BLAKE2b-256 8ffb08f5c5fa672cc8eedd107fb3bab92fe56317eae9b4b1bb9f5301fd6f1a6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page