A lightweight Byte Pair Encoding (BPE) tokenizer built from scratch.

Project description

bpe-lite ⚡️

A lightweight Byte Pair Encoding (BPE) tokenizer built from scratch in Python.

bpe-lite is a fast, minimal, and educational implementation of BPE. It is heavily inspired by the curriculum of Stanford CS336 (Language Modeling from Scratch), specifically mirroring the structure and requirements of their foundational tokenizer assignment.

This package is designed to be easily readable for those learning how Large Language Models (LLMs) process text, while still implementing algorithmic optimizations that make it practical for small-to-medium scale dataset tokenization.

✨ Features

Inverted Index Optimization: Training uses an inverted index to track byte pairs, drastically speeding up the merge process compared to naive brute-force counting.
$O(1)$ Merge Lookups: The inference class (Tokenizer) pre-computes merge ranks, avoiding $O(N)$ list lookups during encoding.
Special Token Support: Safely isolates and preserves special tokens (like <|endoftext|>) during pre-tokenization.
Compression Artifacts: Automatically calculates and reports the dataset compression ratio upon training completion.
Modern Packaging: Built using the src/ layout and pyproject.toml for clean, reliable pip installation.

📦 Installation

You can install bpe-lite directly via pip:

pip install bpe-lite

(Note: Requires regex as its only external dependency).

🚀 Usage

You can use bpe-lite either directly from your terminal using the built-in CLI, or programmatically within your Python scripts.

1. Command Line Interface (CLI)

After installing the package, the bpe-train command becomes globally available. You can use it to train a new tokenizer on a raw text file.

bpe-train \
  --input ./data/train.txt \
  --vocab-size 10000 \
  --special-tokens "<|endoftext|>" "<|pad|>" \
  --out-dir ./tokenizer_models

This will process train.txt, calculate the optimal merges, print the final compression ratio, and save vocab.pkl and merges.pkl to the specified output directory.

2. Python API

Training a Tokenizer

You can invoke the training logic directly from Python if you are working inside a script or Jupyter Notebook.

from bpe_lite import train
import pickle

# Train the tokenizer
vocab, merges = train(
    input_path="./data/train.txt", 
    vocab_size=10000, 
    special_tokens=["<|endoftext|>"]
)

# Save the artifacts manually
with open("vocab.pkl", 'wb') as f:
    pickle.dump(vocab, f)
with open("merges.pkl", 'wb') as f:
    pickle.dump(merges, f)

Encoding and Decoding (Inference)

Use the Tokenizer class to load your trained vocabulary and encode/decode text.

from bpe_lite.tokenizer import Tokenizer

# Initialize from saved files
tokenizer = Tokenizer.from_files(
    vocab_filepath="vocab.pkl", 
    merges_filepath="merges.pkl", 
    special_tokens=["<|endoftext|>"]
)

# Encode raw text to integer IDs
text = "Hello world! <|endoftext|>"
ids = tokenizer.encode(text)
print(f"Token IDs: {ids}")

# Decode integer IDs back to strings
decoded_text = tokenizer.decode(ids)
print(f"Decoded text: {decoded_text}")

Lazy Encoding for Large Datasets

If you are processing massive datasets, you can use encode_iterable to lazily yield token IDs without blowing up your RAM:

def text_stream():
    yield "First chunk of text."
    yield "Second chunk of text."

# Yields token IDs one by one
for token_id in tokenizer.encode_iterable(text_stream()):
    print(token_id)

📚 Acknowledgments

This repository was built as an educational exercise inspired by Stanford CS336: Language Modeling from Scratch. It serves as a practical demonstration of how modern LLM tokenizers (like OpenAI's tiktoken) operate under the hood.

🗺 Roadmap

Initial BPE implementation (CS336-inspired)
PyPI packaging and CLI
Add unit tests (pytest) for round-trip encoding and special tokens
Add support for custom regex patterns

📄 License

This project is licensed under the MIT License.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_lite-0.1.0.tar.gz (7.2 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bpe_lite-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file bpe_lite-0.1.0.tar.gz.

File metadata

Download URL: bpe_lite-0.1.0.tar.gz
Upload date: May 13, 2026
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for bpe_lite-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`931797a75df9b53ef074ee4929f887575df6b06cf7e4c9a2ed9e433f76fd2cfa`
MD5	`e240c1ee4e8eff6fe161183183076050`
BLAKE2b-256	`58451bef5fca5c7a287ee483936d7a5e7f18c0f73dd28c89954db2764e3d9778`

See more details on using hashes here.

File details

Details for the file bpe_lite-0.1.0-py3-none-any.whl.

File metadata

Download URL: bpe_lite-0.1.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 9.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for bpe_lite-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f57e1003945d154422c04dac9d6a71502b37b98e01854d34c11dc89db8ce0665`
MD5	`080e8e86c2677cdfdf0a20466010a6b6`
BLAKE2b-256	`0beaea2ad2329e2743ce2d12e198499306e15cd60af18d836b52575e9695667a`

See more details on using hashes here.

bpe-lite 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

bpe-lite ⚡️

✨ Features

📦 Installation

🚀 Usage

1. Command Line Interface (CLI)

2. Python API

Training a Tokenizer

Encoding and Decoding (Inference)

Lazy Encoding for Large Datasets

📚 Acknowledgments

🗺 Roadmap

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes