Byte-pair encoding tokenizer built from scratch
Project description
BPE from Scratch
A ground-up Python implementation of Byte-Level Byte Pair Encoding (BBPE) — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, and other modern LLMs.
What is Byte-Level BPE?
BPE was originally a data compression algorithm that replaces the most frequent pair of bytes in a sequence with a single unused symbol. Applied to NLP, it builds a subword vocabulary by iteratively merging the most frequent adjacent token pairs in a corpus.
The byte-level variant (introduced by OpenAI for GPT-2) operates directly on raw UTF-8 bytes rather than characters or words:
- Base vocabulary of 256 — one token per possible byte value (0–255), no unknown tokens ever
- Language-agnostic — any Unicode text (code, math, emoji, CJK, ...) is representable without a special
<UNK>token - Lossless — encoding and decoding are exact roundtrips
- Merges learned greedily — at each step, the most frequent adjacent pair is merged and assigned a new token ID (256, 257, ...)
This is the same fundamental approach used by tiktoken (OpenAI) and Hugging Face tokenizers for GPT-style models.
Algorithm Phases
| Phase | Description |
|---|---|
| 1 | UTF-8 encoding — normalize and encode input text to bytes |
| 2 | Byte → token conversion — represent each byte as an integer token ID in [0, 255] |
| 3 | Pair counting — count all adjacent token pairs in the sequence |
| 4 | Merge — replace the most frequent pair with a new token ID |
| 5 | Repeat — iterate until the target vocabulary size is reached |
Installation
pip install bpe-from-scratch
Usage
Train from scratch
from bpe_from_scratch import ByteLevelBPE
bpe = ByteLevelBPE()
bpe.train(text, vocab_size=1024) # 1024 total tokens, 768 merge rules
bpe.save("my_model.json")
Or train directly from a folder of .txt files:
from bpe_from_scratch.train import train_from_folder
bpe = train_from_folder(
folder_path="data/corpus_A/",
model_path="my_model.json",
vocab_size=1024,
)
Encode and decode
tokens = bpe.encode("Hello, world!") # list[int]
text = bpe.decode(tokens) # "Hello, world!"
Continue training on new data
Load an existing model and extend the vocabulary without discarding what was already learned:
bpe = ByteLevelBPE()
bpe.load("my_model.json")
bpe.continue_train(new_text, new_vocab_size=1280) # extend to 1280 total tokens
bpe.save("my_model.json")
All previously learned token IDs remain stable — documents encoded with the old model are still valid after the update.
Or use the folder helper:
from bpe_from_scratch.train import continue_train_from_folder
bpe = continue_train_from_folder(
folder_path="data/corpus_B/",
model_path="my_model.json",
new_vocab_size=1280,
)
Note:
continue_trainuses a frozen-base approach — existing merges are replayed on the new text before new rules are learned. This keeps token IDs stable but is not equivalent to a full retrain on the combined corpus. See TRAINING_GUIDE.md for details and tradeoffs.
Project Structure
src/bpe.py # Core implementation
tests/test_bpe.py # Unit tests
tests/manual/ # Interactive notebooks for experimentation
Running Tests
PYTHONPATH=src python3 -m unittest discover -s tests -v
Acknowledgements
Inspired by Andrej Karpathy's minbpe.
References
- minbpe — Minimal BPE implementation by Andrej Karpathy
- GPT-2 Paper — Language Models are Unsupervised Multitask Learners (Radford et al., 2019)
- Byte-Pair Encoding tokenization — Hugging Face NLP Course
- Neural Machine Translation of Rare Words with Subword Units — original BPE-for-NLP paper (Sennrich et al., 2016)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bpe_from_scratch-0.2.1.tar.gz.
File metadata
- Download URL: bpe_from_scratch-0.2.1.tar.gz
- Upload date:
- Size: 679.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e54e031b9e7148a3d5a20904c875065c64ba83027e9e2e6c35f8e6bae1141b56
|
|
| MD5 |
51c4c65e8a307c6217768a77c86ae8fe
|
|
| BLAKE2b-256 |
3765c754021983be953342c8d095a828d504104a9be80db53c01f25d1f0aac8d
|
File details
Details for the file bpe_from_scratch-0.2.1-py3-none-any.whl.
File metadata
- Download URL: bpe_from_scratch-0.2.1-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfa0142c54d27d07cfc899dbb343f134670d2df90846b51c81a3e5163b412802
|
|
| MD5 |
1ac89b6848866bda96e71d02adfe9777
|
|
| BLAKE2b-256 |
36331169448ff40943c956939b761ea959a18043e22ed625e9ea053d4561014a
|