Byte-pair encoding tokenizer built from scratch

These details have not been verified by PyPI

Project links

Project description

BPE from Scratch

A ground-up Python implementation of Byte-Level Byte Pair Encoding (BBPE) — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, and other modern LLMs.

What is Byte-Level BPE?

BPE was originally a data compression algorithm that replaces the most frequent pair of bytes in a sequence with a single unused symbol. Applied to NLP, it builds a subword vocabulary by iteratively merging the most frequent adjacent token pairs in a corpus.

The byte-level variant (introduced by OpenAI for GPT-2) operates directly on raw UTF-8 bytes rather than characters or words:

Base vocabulary of 256 — one token per possible byte value (0–255), no unknown tokens ever
Language-agnostic — any Unicode text (code, math, emoji, CJK, ...) is representable without a special <UNK> token
Lossless — encoding and decoding are exact roundtrips
Merges learned greedily — at each step, the most frequent adjacent pair is merged and assigned a new token ID (256, 257, ...)

This is the same fundamental approach used by tiktoken (OpenAI) and Hugging Face tokenizers for GPT-style models.

Algorithm Phases

Phase	Description
1	UTF-8 encoding — normalize and encode input text to bytes
2	Byte → token conversion — represent each byte as an integer token ID in [0, 255]
3	Pair counting — count all adjacent token pairs in the sequence
4	Merge — replace the most frequent pair with a new token ID
5	Repeat — iterate until the target vocabulary size is reached

Project Structure

src/bpe.py        # Core implementation
tests/test_bpe.py # Unit tests
tests/manual/     # Interactive notebooks for experimentation

Running Tests

PYTHONPATH=src python3 -m unittest discover -s tests -v

Acknowledgements

Inspired by Andrej Karpathy's minbpe.

References

minbpe — Minimal BPE implementation by Andrej Karpathy
GPT-2 Paper — Language Models are Unsupervised Multitask Learners (Radford et al., 2019)
Byte-Pair Encoding tokenization — Hugging Face NLP Course
Neural Machine Translation of Rare Words with Subword Units — original BPE-for-NLP paper (Sennrich et al., 2016)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_from_scratch-0.1.1.tar.gz (676.4 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bpe_from_scratch-0.1.1-py3-none-any.whl (6.8 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file bpe_from_scratch-0.1.1.tar.gz.

File metadata

Download URL: bpe_from_scratch-0.1.1.tar.gz
Upload date: Mar 28, 2026
Size: 676.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for bpe_from_scratch-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`39c1297a0f0aa30400a4d91d85dce0cbcc35c7504230ca3b6eec3311bdded792`
MD5	`a368d8ee8817abe5e020659f16ab2f60`
BLAKE2b-256	`c539c945954f8b68191a5f9339900b978f6944764ba6cbede317032f28b74a27`

See more details on using hashes here.

File details

Details for the file bpe_from_scratch-0.1.1-py3-none-any.whl.

File metadata

Download URL: bpe_from_scratch-0.1.1-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 6.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for bpe_from_scratch-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`153eae623c8e2c6561a87f5eb38bbde06480a78cea55ad0abbf076c1d46e0d44`
MD5	`4ad61cca1963c6e265f3c3026cec3477`
BLAKE2b-256	`bb51324e5611213a577e0e3daca399a56cefda9c80e1528c1cbd682cb20151c3`

See more details on using hashes here.

bpe-from-scratch 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

BPE from Scratch

What is Byte-Level BPE?

Algorithm Phases

Project Structure

Running Tests

Acknowledgements

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes