Skip to main content

A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format. The `bpetokenizer` also supports [pretrained](bpetokenizer/pretrained/) tokenizers.

Project description

bpetokenizer

A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports save and load tokenizers in the json and file format. The bpetokenizer also supports pretrained tokenizers.

Overview

The Byte Pair Encoding (BPE) algorithm is a simple yet powerful method for building a vocabulary of subword units for a given text corpus. This tokenizer can be used for training your tokenizer of the LLM on various languages of text corpus.

this algorithm is first introduced in the paper Neural Machine Translation of Rare Words with Subword Units and then used this in the gpt2 tokenizer(Language Models are Unsupervised Multitask Learners)

The notebook which shows the BPE algorithm in detail and how the tokenizers work internally.

Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their own text dataset.

Features

  • Implements Byte Pair Encoding (BPE) algorithm.
  • Handles special tokens.
  • Uses a customizable regex pattern for tokenization.
  • Compatible with Python 3.9 and above

This repository has 3 different Tokenizers:

  • BPETokenizer
  • Tokenizer
  • PreTrained
  1. Tokenizer: This class contains train, encode, decode and functionalities to save and load. Also contains few helper functions get_stats, merge, replace_control_characters.. to perform the BPE algorithm for the tokenizer.

  2. BPETokenizer: This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..tiktoken), uses the GPT4_SPLIT_PATTERN to split the text as mentioned in the gpt4 tokenizer. also handles the special_tokens (refer sample_bpetokenizer). which inherits the save and load functionlities to save and load the tokenizer respectively.

  3. PreTrained Tokenizer: PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.

Usage

this tutorial leverages the special_tokens usage in the Tokenizer.

Install the package

pip install bpetokenizer
from bpetokenizer import BPETokenizer

special_tokens = {
    "<|endoftext|>": 1001,
    "<|startoftext|>": 1002,
    "[SPECIAL1]": 1003,
    "[SPECIAL2]": 1004,
}

tokenizer = BPETokenizer(special_tokens=special_tokens) # you can also use the method _special_tokens to register the special tokens (if not passed when intializing)
texts = "<|startoftext|> Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.<|endoftext|>"

tokenizer.train(texts, vocab_size=310, verbose=True)
# tokenizer._special_tokens(special_tokens) # if not passed when intialization of the BPETokenizer

encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.
Hello, World! This is yet another sample text, with [SPECIAL1] and [SPECIAL2] making an appearance.
Hey there, World! Testing the tokenizer with [SPECIAL1] and [SPECIAL2] to see if it handles special tokens properly.
Salutations, Planet! The tokenizer should recognize [SPECIAL1] and [SPECIAL2] in this long string of text.
Hello again, World! [SPECIAL1] and [SPECIAL2] are special tokens that need to be handled correctly by the tokenizer.
Welcome, World! Including [SPECIAL1] and [SPECIAL2] multiple times in this large text to ensure proper encoding.
Hi, World! Let's add [SPECIAL1] and [SPECIAL2] in various parts of this long sentence to test the tokenizer thoroughly.
<|endoftext|>
"""
ids = tokenizer.encode(encode_text, special_tokens="all")
print(ids)

decode_text = tokenizer.decode(ids)
print(decode_text)

tokenizer.save("sample_bpetokenizer", mode="json") # mode: default is file

refer sample_bpetokenizer to have an understanding of the vocab and the model file of the tokenizer trained on the above texts.

To Load the Tokenizer

from bpetokenizer import BPETokenizer

tokenizer = BPETokenizer()

tokenizer.load("sample_bpetokenizer.json", mode="json")

encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.<|endoftext|>"""

print("vocab: ", tokenizer.vocab)
print('---')
print("merges: ", tokenizer.merges)
print('---')
print("special tokens: ", tokenizer.special_tokens)

ids = tokenizer.encode(encode_text, special_tokens="all")
print('---')
print(ids)

decode_text = tokenizer.decode(ids)
print('---')
print(decode_text)

# you can also print the tokens and the text chunks split with the pattern.
tokens = tokenizer.tokens(encode_text, verbose=True) # if verbose, prints the text chunks and also the pattern used to split.
print('---')
print("tokens: ", tokens)

refer to the load_json_vocab and run the bpetokenizer_json to get an overview of vocab, merges, special_tokens and to view the tokens that are split by the tokenizer using pattern, look at tokens

To load the pretrained tokenizers

from bpetokenizer import BPETokenzier

tokenizer = BPETokenizer.from_pretrained("wi17k_base", verbose=True)

texts = """
def get_stats(tokens, counts=None) -> dict:
    "Get statistics of the tokens. Includes the frequency of each consecutive pair of tokens"
    counts = if counts is None else counts
    for pair in zip(tokens, tokens[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts
"""
tokenizer.tokens(texts, verbose=True)

for now, we only have a single 17k vocab tokenizer wi17_base at pretrained

Run Tests

the tests folder tests/ include the tests of the tokenizer, uses pytest.

python3 -m pytest

additionally, the workflows are setup to run the tests when made a PR.

Contributing

Contributions to the BPE Tokenizer are most welcomed! If you would like to contribute, please follow these steps:

  • Star and Fork the repository.
  • Create a new branch (git checkout -b feature/your-feature).
  • Commit your changes (git commit -m 'Add some feature').
  • Push to the branch (git push origin feature/your-feature).
  • Create a new Pull Request.

Please ensure your code follows the project's coding standards and includes appropriate tests. Also, update the documentation as necessary.

License

This project is licensed under the MIT License.


*this tokenizer is inspired from the minbpe, but more optimized.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpetokenizer-1.2.1.tar.gz (250.3 kB view details)

Uploaded Source

Built Distribution

bpetokenizer-1.2.1-py3-none-any.whl (247.8 kB view details)

Uploaded Python 3

File details

Details for the file bpetokenizer-1.2.1.tar.gz.

File metadata

  • Download URL: bpetokenizer-1.2.1.tar.gz
  • Upload date:
  • Size: 250.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for bpetokenizer-1.2.1.tar.gz
Algorithm Hash digest
SHA256 593cde30cc9de777c55a10ba91885ebbd05cd8ac95593a81d1b81f1e7a62af40
MD5 824e18d9ba7355f7bc626ca90c90cc7e
BLAKE2b-256 66f692b3986e527e712d7501a59a445432184b57dea18928a8ac642f9070fc67

See more details on using hashes here.

File details

Details for the file bpetokenizer-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: bpetokenizer-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 247.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for bpetokenizer-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 560a97281d7f79c57a8e404ca1fd478e84c37a6e87138133d0f7c8c05d963149
MD5 8949f3bf0924b4aed4ae3bbcb7a2619e
BLAKE2b-256 aae8ba20b383752b8acc3eb0440b263e5c44e2286f749fb5133b99ba7ff1f4ba

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page