Skip to main content

Byte-pair encoding tokenizer built from scratch

Project description

BPE from Scratch

A ground-up Python implementation of Byte-Level Byte Pair Encoding (BBPE) — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, and other modern LLMs.

What is Byte-Level BPE?

BPE was originally a data compression algorithm that replaces the most frequent pair of bytes in a sequence with a single unused symbol. Applied to NLP, it builds a subword vocabulary by iteratively merging the most frequent adjacent token pairs in a corpus.

The byte-level variant (introduced by OpenAI for GPT-2) operates directly on raw UTF-8 bytes rather than characters or words:

  • Base vocabulary of 256 — one token per possible byte value (0–255), no unknown tokens ever
  • Language-agnostic — any Unicode text (code, math, emoji, CJK, ...) is representable without a special <UNK> token
  • Lossless — encoding and decoding are exact roundtrips
  • Merges learned greedily — at each step, the most frequent adjacent pair is merged and assigned a new token ID (256, 257, ...)

This is the same fundamental approach used by tiktoken (OpenAI) and Hugging Face tokenizers for GPT-style models.

Algorithm Phases

Phase Description
1 UTF-8 encoding — normalize and encode input text to bytes
2 Byte → token conversion — represent each byte as an integer token ID in [0, 255]
3 Pair counting — count all adjacent token pairs in the sequence
4 Merge — replace the most frequent pair with a new token ID
5 Repeat — iterate until the target vocabulary size is reached

Project Structure

src/bpe.py        # Core implementation
tests/test_bpe.py # Unit tests
tests/manual/     # Interactive notebooks for experimentation

Running Tests

PYTHONPATH=src python3 -m unittest discover -s tests -v

Acknowledgements

Inspired by Andrej Karpathy's minbpe.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_from_scratch-0.1.1.tar.gz (676.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpe_from_scratch-0.1.1-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file bpe_from_scratch-0.1.1.tar.gz.

File metadata

  • Download URL: bpe_from_scratch-0.1.1.tar.gz
  • Upload date:
  • Size: 676.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for bpe_from_scratch-0.1.1.tar.gz
Algorithm Hash digest
SHA256 39c1297a0f0aa30400a4d91d85dce0cbcc35c7504230ca3b6eec3311bdded792
MD5 a368d8ee8817abe5e020659f16ab2f60
BLAKE2b-256 c539c945954f8b68191a5f9339900b978f6944764ba6cbede317032f28b74a27

See more details on using hashes here.

File details

Details for the file bpe_from_scratch-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for bpe_from_scratch-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 153eae623c8e2c6561a87f5eb38bbde06480a78cea55ad0abbf076c1d46e0d44
MD5 4ad61cca1963c6e265f3c3026cec3477
BLAKE2b-256 bb51324e5611213a577e0e3daca399a56cefda9c80e1528c1cbd682cb20151c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page