A lightweight Byte Pair Encoding (BPE) tokenizer built from scratch.
Project description
bpe-lite ⚡️
A lightweight Byte Pair Encoding (BPE) tokenizer built from scratch in Python.
bpe-lite is a fast, minimal, and educational implementation of BPE. It is heavily inspired by the curriculum of Stanford CS336 (Language Modeling from Scratch), specifically mirroring the structure and requirements of their foundational tokenizer assignment.
This package is designed to be easily readable for those learning how Large Language Models (LLMs) process text, while still implementing algorithmic optimizations that make it practical for small-to-medium scale dataset tokenization.
✨ Features
- Inverted Index Optimization: Training uses an inverted index to track byte pairs, drastically speeding up the merge process compared to naive brute-force counting.
- $O(1)$ Merge Lookups: The inference class (
Tokenizer) pre-computes merge ranks, avoiding $O(N)$ list lookups during encoding. - Special Token Support: Safely isolates and preserves special tokens (like
<|endoftext|>) during pre-tokenization. - Compression Artifacts: Automatically calculates and reports the dataset compression ratio upon training completion.
- Modern Packaging: Built using the
src/layout andpyproject.tomlfor clean, reliablepipinstallation.
📦 Installation
You can install bpe-lite directly via pip:
pip install bpe-lite
(Note: Requires regex as its only external dependency).
🚀 Usage
You can use bpe-lite either directly from your terminal using the built-in CLI, or programmatically within your Python scripts.
1. Command Line Interface (CLI)
After installing the package, the bpe-train command becomes globally available. You can use it to train a new tokenizer on a raw text file.
bpe-train \
--input ./data/train.txt \
--vocab-size 10000 \
--special-tokens "<|endoftext|>" "<|pad|>" \
--out-dir ./tokenizer_models
This will process train.txt, calculate the optimal merges, print the final compression ratio, and save vocab.pkl and merges.pkl to the specified output directory.
2. Python API
Training a Tokenizer
You can invoke the training logic directly from Python if you are working inside a script or Jupyter Notebook.
from bpe_lite import train
import pickle
# Train the tokenizer
vocab, merges = train(
input_path="./data/train.txt",
vocab_size=10000,
special_tokens=["<|endoftext|>"]
)
# Save the artifacts manually
with open("vocab.pkl", 'wb') as f:
pickle.dump(vocab, f)
with open("merges.pkl", 'wb') as f:
pickle.dump(merges, f)
Encoding and Decoding (Inference)
Use the Tokenizer class to load your trained vocabulary and encode/decode text.
from bpe_lite.tokenizer import Tokenizer
# Initialize from saved files
tokenizer = Tokenizer.from_files(
vocab_filepath="vocab.pkl",
merges_filepath="merges.pkl",
special_tokens=["<|endoftext|>"]
)
# Encode raw text to integer IDs
text = "Hello world! <|endoftext|>"
ids = tokenizer.encode(text)
print(f"Token IDs: {ids}")
# Decode integer IDs back to strings
decoded_text = tokenizer.decode(ids)
print(f"Decoded text: {decoded_text}")
Lazy Encoding for Large Datasets
If you are processing massive datasets, you can use encode_iterable to lazily yield token IDs without blowing up your RAM:
def text_stream():
yield "First chunk of text."
yield "Second chunk of text."
# Yields token IDs one by one
for token_id in tokenizer.encode_iterable(text_stream()):
print(token_id)
📚 Acknowledgments
This repository was built as an educational exercise inspired by Stanford CS336: Language Modeling from Scratch. It serves as a practical demonstration of how modern LLM tokenizers (like OpenAI's tiktoken) operate under the hood.
🗺 Roadmap
- Initial BPE implementation (CS336-inspired)
- PyPI packaging and CLI
- Add unit tests (pytest) for round-trip encoding and special tokens
- Add support for custom regex patterns
📄 License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bpe_lite-0.1.0.tar.gz.
File metadata
- Download URL: bpe_lite-0.1.0.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
931797a75df9b53ef074ee4929f887575df6b06cf7e4c9a2ed9e433f76fd2cfa
|
|
| MD5 |
e240c1ee4e8eff6fe161183183076050
|
|
| BLAKE2b-256 |
58451bef5fca5c7a287ee483936d7a5e7f18c0f73dd28c89954db2764e3d9778
|
File details
Details for the file bpe_lite-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bpe_lite-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f57e1003945d154422c04dac9d6a71502b37b98e01854d34c11dc89db8ce0665
|
|
| MD5 |
080e8e86c2677cdfdf0a20466010a6b6
|
|
| BLAKE2b-256 |
0beaea2ad2329e2743ce2d12e198499306e15cd60af18d836b52575e9695667a
|