A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format. The `bpetokenizer` also supports [pretrained](bpetokenizer/pretrained/) tokenizers.
Project description
bpetokenizer
A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports save
and load
tokenizers in the json
and file
format. The bpetokenizer
also supports pretrained tokenizers.
Overview
The Byte Pair Encoding (BPE) algorithm is a simple yet powerful method for building a vocabulary of subword units for a given text corpus. This tokenizer can be used for training your tokenizer of the LLM on various languages of text corpus.
this algorithm is first introduced in the paper Neural Machine Translation of Rare Words with Subword Units and then used this in the gpt2 tokenizer(Language Models are Unsupervised Multitask Learners)
The notebook which shows the BPE algorithm in detail and how the tokenizers work internally.
Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their own text dataset.
Features
- Implements Byte Pair Encoding (BPE) algorithm.
- Handles special tokens.
- Uses a customizable regex pattern for tokenization.
- Compatible with Python 3.9 and above
This repository has 3 different Tokenizers:
BPETokenizer
Tokenizer
PreTrained
-
Tokenizer: This class contains
train
,encode
,decode
and functionalities tosave
andload
. Also contains few helper functionsget_stats
,merge
,replace_control_characters
.. to perform the BPE algorithm for the tokenizer. -
BPETokenizer: This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..tiktoken), uses the
GPT4_SPLIT_PATTERN
to split the text as mentioned in the gpt4 tokenizer. also handles thespecial_tokens
(refer sample_bpetokenizer). which inherits thesave
andload
functionlities to save and load the tokenizer respectively. -
PreTrained Tokenizer: PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.
Usage
this tutorial leverages the special_tokens
usage in the Tokenizer.
Install the package
pip install bpetokenizer
from bpetokenizer import BPETokenizer
special_tokens = {
"<|endoftext|>": 1001,
"<|startoftext|>": 1002,
"[SPECIAL1]": 1003,
"[SPECIAL2]": 1004,
}
tokenizer = BPETokenizer(special_tokens=special_tokens) # you can also use the method _special_tokens to register the special tokens (if not passed when intializing)
texts = "<|startoftext|> Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.<|endoftext|>"
tokenizer.train(texts, vocab_size=310, verbose=True)
# tokenizer._special_tokens(special_tokens) # if not passed when intialization of the BPETokenizer
encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.
Hello, World! This is yet another sample text, with [SPECIAL1] and [SPECIAL2] making an appearance.
Hey there, World! Testing the tokenizer with [SPECIAL1] and [SPECIAL2] to see if it handles special tokens properly.
Salutations, Planet! The tokenizer should recognize [SPECIAL1] and [SPECIAL2] in this long string of text.
Hello again, World! [SPECIAL1] and [SPECIAL2] are special tokens that need to be handled correctly by the tokenizer.
Welcome, World! Including [SPECIAL1] and [SPECIAL2] multiple times in this large text to ensure proper encoding.
Hi, World! Let's add [SPECIAL1] and [SPECIAL2] in various parts of this long sentence to test the tokenizer thoroughly.
<|endoftext|>
"""
ids = tokenizer.encode(encode_text, special_tokens="all")
print(ids)
decode_text = tokenizer.decode(ids)
print(decode_text)
tokenizer.save("sample_bpetokenizer", mode="json") # mode: default is file
refer sample_bpetokenizer to have an understanding of the vocab
and the model
file of the tokenizer trained on the above texts.
To Load the Tokenizer
from bpetokenizer import BPETokenizer
tokenizer = BPETokenizer()
tokenizer.load("sample_bpetokenizer.json", mode="json")
encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.<|endoftext|>"""
print("vocab: ", tokenizer.vocab)
print('---')
print("merges: ", tokenizer.merges)
print('---')
print("special tokens: ", tokenizer.special_tokens)
ids = tokenizer.encode(encode_text, special_tokens="all")
print('---')
print(ids)
decode_text = tokenizer.decode(ids)
print('---')
print(decode_text)
# you can also print the tokens and the text chunks split with the pattern.
tokens = tokenizer.tokens(encode_text, verbose=True) # if verbose, prints the text chunks and also the pattern used to split.
print('---')
print("tokens: ", tokens)
refer to the load_json_vocab and run the bpetokenizer_json
to get an overview of vocab
, merges
, special_tokens
and to view the tokens that are split by the tokenizer using pattern, look at tokens
To load the pretrained tokenizers
from bpetokenizer import BPETokenzier
tokenizer = BPETokenizer.from_pretrained("wi17k_base", verbose=True)
texts = """
def get_stats(tokens, counts=None) -> dict:
"Get statistics of the tokens. Includes the frequency of each consecutive pair of tokens"
counts = if counts is None else counts
for pair in zip(tokens, tokens[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
"""
tokenizer.tokens(texts, verbose=True)
for now, we only have a single 17k vocab tokenizer wi17_base
at pretrained
Run Tests
the tests folder tests/
include the tests of the tokenizer, uses pytest.
python3 -m pytest
additionally, the workflows are setup to run the tests when made a PR.
Contributing
Contributions to the BPE Tokenizer are most welcomed! If you would like to contribute, please follow these steps:
- Star and Fork the repository.
- Create a new branch (git checkout -b feature/your-feature).
- Commit your changes (git commit -m 'Add some feature').
- Push to the branch (git push origin feature/your-feature).
- Create a new Pull Request.
Please ensure your code follows the project's coding standards and includes appropriate tests. Also, update the documentation as necessary.
License
This project is licensed under the MIT License.
*this tokenizer is inspired from the minbpe, but more optimized.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bpetokenizer-1.2.1.tar.gz
.
File metadata
- Download URL: bpetokenizer-1.2.1.tar.gz
- Upload date:
- Size: 250.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 593cde30cc9de777c55a10ba91885ebbd05cd8ac95593a81d1b81f1e7a62af40 |
|
MD5 | 824e18d9ba7355f7bc626ca90c90cc7e |
|
BLAKE2b-256 | 66f692b3986e527e712d7501a59a445432184b57dea18928a8ac642f9070fc67 |
File details
Details for the file bpetokenizer-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: bpetokenizer-1.2.1-py3-none-any.whl
- Upload date:
- Size: 247.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 560a97281d7f79c57a8e404ca1fd478e84c37a6e87138133d0f7c8c05d963149 |
|
MD5 | 8949f3bf0924b4aed4ae3bbcb7a2619e |
|
BLAKE2b-256 | aae8ba20b383752b8acc3eb0440b263e5c44e2286f749fb5133b99ba7ff1f4ba |