Skip to main content

A lightweight Byte Pair Encoding (BPE) tokenizer built from scratch.

Project description

bpe-lite ⚡️

A lightweight Byte Pair Encoding (BPE) tokenizer built from scratch in Python.

bpe-lite is a fast, minimal, and educational implementation of BPE. It is heavily inspired by the curriculum of Stanford CS336 (Language Modeling from Scratch), specifically mirroring the structure and requirements of their foundational tokenizer assignment.

This package is designed to be easily readable for those learning how Large Language Models (LLMs) process text, while still implementing algorithmic optimizations that make it practical for small-to-medium scale dataset tokenization.

✨ Features

  • Inverted Index Optimization: Training uses an inverted index to track byte pairs, drastically speeding up the merge process compared to naive brute-force counting.
  • $O(1)$ Merge Lookups: The inference class (Tokenizer) pre-computes merge ranks, avoiding $O(N)$ list lookups during encoding.
  • Special Token Support: Safely isolates and preserves special tokens (like <|endoftext|>) during pre-tokenization.
  • Compression Artifacts: Automatically calculates and reports the dataset compression ratio upon training completion.
  • Modern Packaging: Built using the src/ layout and pyproject.toml for clean, reliable pip installation.

📦 Installation

You can install bpe-lite directly via pip:

pip install bpe-lite

(Note: Requires regex as its only external dependency).

🚀 Usage

You can use bpe-lite either directly from your terminal using the built-in CLI, or programmatically within your Python scripts.

1. Command Line Interface (CLI)

After installing the package, the bpe-train command becomes globally available. You can use it to train a new tokenizer on a raw text file.

bpe-train \
  --input ./data/train.txt \
  --vocab-size 10000 \
  --special-tokens "<|endoftext|>" "<|pad|>" \
  --out-dir ./tokenizer_models

This will process train.txt, calculate the optimal merges, print the final compression ratio, and save vocab.pkl and merges.pkl to the specified output directory.

2. Python API

Training a Tokenizer

You can invoke the training logic directly from Python if you are working inside a script or Jupyter Notebook.

from bpe_lite import train
import pickle

# Train the tokenizer
vocab, merges = train(
    input_path="./data/train.txt", 
    vocab_size=10000, 
    special_tokens=["<|endoftext|>"]
)

# Save the artifacts manually
with open("vocab.pkl", 'wb') as f:
    pickle.dump(vocab, f)
with open("merges.pkl", 'wb') as f:
    pickle.dump(merges, f)

Encoding and Decoding (Inference)

Use the Tokenizer class to load your trained vocabulary and encode/decode text.

from bpe_lite.tokenizer import Tokenizer

# Initialize from saved files
tokenizer = Tokenizer.from_files(
    vocab_filepath="vocab.pkl", 
    merges_filepath="merges.pkl", 
    special_tokens=["<|endoftext|>"]
)

# Encode raw text to integer IDs
text = "Hello world! <|endoftext|>"
ids = tokenizer.encode(text)
print(f"Token IDs: {ids}")

# Decode integer IDs back to strings
decoded_text = tokenizer.decode(ids)
print(f"Decoded text: {decoded_text}")

Lazy Encoding for Large Datasets

If you are processing massive datasets, you can use encode_iterable to lazily yield token IDs without blowing up your RAM:

def text_stream():
    yield "First chunk of text."
    yield "Second chunk of text."

# Yields token IDs one by one
for token_id in tokenizer.encode_iterable(text_stream()):
    print(token_id)

📚 Acknowledgments

This repository was built as an educational exercise inspired by Stanford CS336: Language Modeling from Scratch. It serves as a practical demonstration of how modern LLM tokenizers (like OpenAI's tiktoken) operate under the hood.

🗺 Roadmap

  • Initial BPE implementation (CS336-inspired)
  • PyPI packaging and CLI
  • Add unit tests (pytest) for round-trip encoding and special tokens
  • Add support for custom regex patterns

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_lite-0.1.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpe_lite-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file bpe_lite-0.1.0.tar.gz.

File metadata

  • Download URL: bpe_lite-0.1.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for bpe_lite-0.1.0.tar.gz
Algorithm Hash digest
SHA256 931797a75df9b53ef074ee4929f887575df6b06cf7e4c9a2ed9e433f76fd2cfa
MD5 e240c1ee4e8eff6fe161183183076050
BLAKE2b-256 58451bef5fca5c7a287ee483936d7a5e7f18c0f73dd28c89954db2764e3d9778

See more details on using hashes here.

File details

Details for the file bpe_lite-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bpe_lite-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for bpe_lite-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f57e1003945d154422c04dac9d6a71502b37b98e01854d34c11dc89db8ce0665
MD5 080e8e86c2677cdfdf0a20466010a6b6
BLAKE2b-256 0beaea2ad2329e2743ce2d12e198499306e15cd60af18d836b52575e9695667a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page