Skip to main content

Byte-pair encoding tokenizer built from scratch

Project description

BPE from Scratch

A ground-up Python implementation of Byte-Level Byte Pair Encoding (BBPE) — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, and other modern LLMs.

What is Byte-Level BPE?

BPE was originally a data compression algorithm that replaces the most frequent pair of bytes in a sequence with a single unused symbol. Applied to NLP, it builds a subword vocabulary by iteratively merging the most frequent adjacent token pairs in a corpus.

The byte-level variant (introduced by OpenAI for GPT-2) operates directly on raw UTF-8 bytes rather than characters or words:

  • Base vocabulary of 256 — one token per possible byte value (0–255), no unknown tokens ever
  • Language-agnostic — any Unicode text (code, math, emoji, CJK, ...) is representable without a special <UNK> token
  • Lossless — encoding and decoding are exact roundtrips
  • Merges learned greedily — at each step, the most frequent adjacent pair is merged and assigned a new token ID (256, 257, ...)

This is the same fundamental approach used by tiktoken (OpenAI) and Hugging Face tokenizers for GPT-style models.

Algorithm Phases

Phase Description
1 UTF-8 encoding — normalize and encode input text to bytes
2 Byte → token conversion — represent each byte as an integer token ID in [0, 255]
3 Pair counting — count all adjacent token pairs in the sequence
4 Merge — replace the most frequent pair with a new token ID
5 Repeat — iterate until the target vocabulary size is reached

Installation

pip install bpe-from-scratch

Usage

Train from scratch

from bpe_from_scratch import ByteLevelBPE

bpe = ByteLevelBPE()
bpe.train(text, vocab_size=1024)  # 1024 total tokens, 768 merge rules
bpe.save("my_model.json")

Or train directly from a folder of .txt files:

from bpe_from_scratch.train import train_from_folder

bpe = train_from_folder(
    folder_path="data/corpus_A/",
    model_path="my_model.json",
    vocab_size=1024,
)

Utilize all CPU cores

Pass num_workers=os.cpu_count() to parallelize pre-tokenization across all cores (useful for large corpora):

import os
from bpe_from_scratch import ByteLevelBPE

bpe = ByteLevelBPE()
bpe.train(text, vocab_size=50_257, num_workers=os.cpu_count())

Or via the folder helper:

import os
from bpe_from_scratch.train import train_from_folder

train_from_folder(
    folder_path="data/corpus/",
    model_path="my_model.json",
    vocab_size=50_257,
    num_workers=os.cpu_count(),
)

Note: On Windows, guard the call site with if __name__ == "__main__": due to spawn-based multiprocessing. On macOS/Linux, fork is used by default and no guard is needed.

Encode and decode

tokens = bpe.encode("Hello, world!")  # list[int]
text   = bpe.decode(tokens)           # "Hello, world!"

Continue training on new data

Load an existing model and extend the vocabulary without discarding what was already learned:

bpe = ByteLevelBPE()
bpe.load("my_model.json")
bpe.continue_train(new_text, new_vocab_size=1280)  # extend to 1280 total tokens
bpe.save("my_model.json")

All previously learned token IDs remain stable — documents encoded with the old model are still valid after the update.

Or use the folder helper:

from bpe_from_scratch.train import continue_train_from_folder

bpe = continue_train_from_folder(
    folder_path="data/corpus_B/",
    model_path="my_model.json",
    new_vocab_size=1280,
)

Note: continue_train uses a frozen-base approach — existing merges are replayed on the new text before new rules are learned. This keeps token IDs stable but is not equivalent to a full retrain on the combined corpus. See TRAINING_GUIDE.md for details and tradeoffs.

Project Structure

src/bpe.py        # Core implementation
tests/test_bpe.py # Unit tests
tests/manual/     # Interactive notebooks for experimentation

Running Tests

PYTHONPATH=src python3 -m unittest discover -s tests -v

Acknowledgements

Inspired by Andrej Karpathy's minbpe.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_from_scratch-0.3.0.tar.gz (681.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpe_from_scratch-0.3.0-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file bpe_from_scratch-0.3.0.tar.gz.

File metadata

  • Download URL: bpe_from_scratch-0.3.0.tar.gz
  • Upload date:
  • Size: 681.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for bpe_from_scratch-0.3.0.tar.gz
Algorithm Hash digest
SHA256 88ad6bec25e58091c5190f4fdc2178997e2185cd7782b3db14a8b3837fce775f
MD5 0efc63c7b7997973bf54c7fe0a5b1470
BLAKE2b-256 005c7c2f259835c471c6fcd5ea567e8098d88de52b20eefe3f44bbba31367289

See more details on using hashes here.

File details

Details for the file bpe_from_scratch-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bpe_from_scratch-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd675c0ec6f6c6488fa7b315080853e100b68d395fa38f43789d68fa5d6cff5b
MD5 d1500094f4766e90467ca767f30f02ed
BLAKE2b-256 ac3bbcb5c87518f2f7947794c3b4c2f6975b6b07f23fa9363b233023c954fd6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page