Skip to main content

A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.

Project description

ByteTok

CI Python versions License

ByteTok implements Byte Pair Encoding (BPE) at the byte-level with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes (0-255), then iteratively merged using learned pair statistics. The training algorithm is based on Algorithm 2 from "A Formal Perspective on Byte-Pair Encoding", achieving O(N log V) training and O(N log N) encoding versus the naive O(NV) approach.

History

This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:

  • Train on custom text
  • Save learned encodings
  • Encode text
  • Decode text

Libraries like OpenAI's tiktoken and Google's sentencepiece exist and are probably better for production work. But ByteTok wasn't designed to compete with them or benchmaxx. I wanted a straightforward API that took a string and returned a list of integers; not something that forced me to read through documentation for 200 function arguments (looking at you, sentencepiece).

As my dataset requirements grew, the naive BPE implementation started struggling. So I rewrote the trainer and encoder in Rust using a much more efficient algorithm 😎.

Features

  • Fast Rust-backed training and encoding via PyO3/maturin for datasets larger than 100MB. ByteTok delivers 600x-1000x better performance when compared to a naive O(NV) implementation.
  • Built-in regex patterns from GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2, DeepSeek, StarCoder, Falcon, and BLOOM.
  • Custom patterns supported alongside the built-in presets.
  • Special token strategies for controlling how special tokens are handled during encoding.
  • Serialization with versioned .model / .vocab file format and from_pretrained() loader.

Benchmarks

Benchmarks were run on Linux x86_64 with an Intel Core i7-12700H (20 cores @ 4.70 GHz) and 32GB DDR5 RAM.

Dataset Corpus Size Vocab Size Training Time Encoding Throughput Decoding Throughput Compression Ratio Size Reduction
Sci-Fi Books (Gutenberg) 88.85 MB (93M chars) 25,000 198s (~3.3 mins) 2.99M chars/sec (2.85 MB/sec) 19.2M tokens/sec 1.43x 30.3%
Sci-Fi Books (Gutenberg) 216.96 MB (227M chars) 10,000 523s (~8.7 mins) 2.86M chars/sec (2.73 MB/sec) 17.1M tokens/sec 1.60x 37.7%
Sci-Fi Books (Gutenberg) 216.96 MB (227M chars) 25,000 579s (~9.65 mins) 2.85M chars/sec (2.72 MB/sec) 17.0M tokens/sec 1.68x 40.6%
Sci-Fi Books (Gutenberg) 216.96 MB (227M chars) 50,000 640s (~10.7 mins) 2.76M chars/sec (2.63 MB/sec) 16.4M tokens/sec 1.75x 42.7%
Sci-Fi Books (Gutenberg) 326.96 MB (343M chars) 50,000 1048s (~17.5 mins) 2.82M chars/sec (2.69 MB/sec) 7.02M tokens/sec 1.44x 30.7%

Requirements

  • Python >= 3.13

Installation

Install from PyPI:

# with pip
pip install bytetok

# or with uv (recommended)
uv add bytetok

Building from Source

If you want to develop or build from source, you'll need a Rust toolchain (rustup):

# clone the repository
git clone https://github.com/vihanga-malaviarachchi/bytetok.git
cd bytetok

# install with uv
uv sync

# or build with maturin
pip install maturin
maturin develop

Quick Start

import bytetok

# create a tokenizer with a built-in pattern (default: gpt4o)
tokenizer = bytetok.get_tokenizer("gpt4o")

# train on text
tokenizer.train("your training corpus here...", vocab_size=1000)

# encode and decode
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"

# save and reload
tokenizer.save("my_tokenizer")
reloaded = bytetok.from_pretrained("my_tokenizer.model")

API Reference

Factory Functions

bytetok.get_tokenizer(pattern="gpt4o", *, custom_pattern=None)

Create a RegexTokenizer with a built-in or custom regex pattern.

  • pattern (str) -- Name of a built-in pattern. Ignored when custom_pattern is set. Default: "gpt4o".
  • custom_pattern (str | None) -- A custom regex pattern string. Overrides pattern when provided.
  • Returns: Tokenizer
  • Raises: PatternError if the custom pattern is invalid regex.
# built-in pattern
tokenizer = bytetok.get_tokenizer("llama3")

# custom pattern
tokenizer = bytetok.get_tokenizer(custom_pattern=r"'s|'t|'re|'ve|'m|'ll|'d| ?\w+")

bytetok.from_pretrained(model_path)

Load a previously saved tokenizer from a .model file. The tokenizer type is auto-detected from the file header.

  • model_path (str) -- Path to the .model file.
  • Returns: Tokenizer (either BasicTokenizer or RegexTokenizer depending on what was saved).
  • Raises: ModelLoadError if the file does not exist, has the wrong extension, contains an unknown tokenizer type, or has a version mismatch.
tokenizer = bytetok.from_pretrained("my_tokenizer.model")

bytetok.get_strategy(name="none-raise", allowed_subset=None)

Create a special token handling strategy for use with encode().

  • name ("all" | "none" | "none-raise" | "custom") -- Strategy name.
  • allowed_subset (set[str] | None) -- Required when name="custom". The set of special token strings to allow.
  • Returns: SpecialTokenStrategy
  • Raises: StrategyError if the name is unknown or "custom" is used without allowed_subset.
strategy = bytetok.get_strategy("all")
strategy = bytetok.get_strategy("custom", allowed_subset={"<|endoftext|>"})

bytetok.list_patterns()

Return the names of all available built-in regex patterns.

  • Returns: list[str]
bytetok.list_patterns()
# ['GPT2', 'GPT4', 'GPT4O', 'LLAMA3', 'QWEN2', 'DEEPSEEK_CODER', 'DEEPSEEK_LLM',
#  'STARCODER', 'FALCON', 'BLOOM']

bytetok.get_pattern(name)

Get the regex pattern string for a specific built-in pattern by name.

  • name (str) -- Name of the built-in pattern (case-insensitive).
  • Returns: str -- The regex pattern string.
  • Raises: PatternError if the pattern name is unknown.
# get a specific pattern string
pattern_str = bytetok.get_pattern("llama3")

# use it to create a tokenizer
tokenizer = bytetok.RegexTokenizer(pattern=pattern_str)

bytetok.list_strategies()

Return the names of all available special token strategies.

  • Returns: list[str]
bytetok.list_strategies()
# ['all', 'none', 'none-raise', 'custom']

Tokenizer Classes

All tokenizers inherit from the abstract base class Tokenizer. The two concrete implementations are BasicTokenizer and RegexTokenizer.

The BasicTokenizer serves as a documentation for the simplest implementation of a BPE tokenizer. It is not recommended for actual use because decode() reconstructs text by decoding a raw byte stream as UTF-8 with replacement (errors="replace"), which can lose information for invalid UTF-8 byte sequences (e.g. broken multi-byte codepoints become ).

All ByteTok's factory methods default to RegexTokenizer. For custom extensions or implementations, always inherit from RegexTokenizer.

Tokenizer (abstract base class)

Manages vocabulary, byte pair merges, and serialization. You do not instantiate this directly; use RegexTokenizer or the factory functions instead.

Attributes
Attribute Type Description
merges dict[tuple[int,int], int] Byte pair -> merged token ID mapping.
vocab dict[int, bytes] Token ID -> byte sequence mapping.
pat str Regex pattern used for text splitting (if any).
special_toks dict[str, int] Special token string -> token ID mapping.
train(text, vocab_size, verbose=False)

Train the tokenizer by learning byte pair merges from the input.

  • text (str | list[str]) -- Training corpus. Lists are concatenated.
  • vocab_size (int) -- Target vocabulary size. Must be > 256.
  • verbose (bool) -- Log each merge operation. Default: False.
  • Raises: VocabularyError if vocab_size <= 256. TrainingError if the input is empty.
encode(text, strategy=None)

Encode text into a list of integer token IDs.

  • text (str) -- Text to encode.
  • strategy (SpecialTokenStrategy | None) -- How to handle special tokens. None means no special token handling.
  • Returns: list[int]
decode(tokens)

Decode a list of token IDs back into text.

  • tokens (list[int]) -- Token IDs to decode.
  • Returns: str
  • Raises: VocabularyError if a token ID is not in the vocabulary (RegexTokenizer).
save(file_prefix)

Save the trained tokenizer to disk. Creates two files:

  • <file_prefix>.model -- Binary merge mappings (used by load() / from_pretrained()).
  • <file_prefix>.vocab -- Human-readable token representations.

Parameters:

  • file_prefix (str) -- Path prefix for the output files.
tokenizer.save("models/my_tok")
# creates models/my_tok.model and models/my_tok.vocab
load(model_filename)

Load tokenizer state from a .model file. Restores merges, special tokens, and rebuilds the vocabulary.

  • model_filename (str) -- Path to the .model file.
  • Raises: ModelLoadError on missing file, wrong extension, version mismatch, or type mismatch.
tokenizer = bytetok.RegexTokenizer()
tokenizer.load("models/my_tok.model")

BasicTokenizer()

Tokenizer that operates directly on raw byte sequences without any regex splitting. Does not support special token strategies.

tok = bytetok.BasicTokenizer()
tok.train("Hello world", vocab_size=300)
tokens = tok.encode("Hello")
text = tok.decode(tokens)

All methods are inherited from Tokenizer. The strategy parameter on encode() is accepted but ignored.

It is recommended not to use this class. Use RegexTokenizer instead.


RegexTokenizer(pattern=None)

Tokenizer that splits text with a regex pattern before applying BPE. Supports special token registration and strategies.

  • pattern (str | None) -- Regex pattern for text splitting. Defaults to the gpt4o pattern when None.
tok = bytetok.RegexTokenizer()                     # default gpt4o pattern
tok = bytetok.RegexTokenizer(pattern=r"\w+|\S")    # custom pattern

In addition to the methods inherited from Tokenizer, RegexTokenizer provides:

register_special_tokens(special_toks)

Register special tokens with auto-assigned IDs. Must be called after training. Token IDs are assigned sequentially starting from the current vocabulary size.

  • special_toks (list[str]) -- Special token strings to register.
  • Raises: SpecialTokenError if the tokenizer has not been trained yet.
tok.train(text, vocab_size=1000)
tok.register_special_tokens(["<|endoftext|>", "<|pad|>", "<|start|>"])

# encode with special token awareness
strategy = bytetok.get_strategy("all")
tokens = tok.encode("Hello<|endoftext|>", strategy=strategy)
text = tok.decode(tokens)

TokenPattern

TokenPattern is a str enum containing pre-defined regex patterns sourced from popular tokenizer implementations.

TokenPattern.get(name)

Look up a pattern by name (case-insensitive).

  • name (str) -- Pattern name.
  • Returns: str -- The regex pattern string.
  • Raises: PatternError if the name is unknown.
pattern = bytetok.TokenPattern.get("gpt4o")

Available Patterns

  • gpt2
  • gpt4
  • gpt4o
  • llama3
  • qwen2
  • deepseek-coder
  • deepseek-llm
  • starcoder
  • falcon
  • bloom

Special Token Strategies

Strategies control how special tokens are recognised during encode(). Pass a strategy instance as the strategy parameter.

SpecialTokenStrategy (abstract base class)

Base class. Subclass this to implement custom strategies.

handle(text, special_toks)
  • text (str) -- The text being encoded.
  • special_toks (dict[str, int]) -- All registered special tokens.
  • Returns: dict[str, int] -- The subset of special tokens to apply.

AllowAllStrategy

Allows all registered special tokens to be recognised during encoding.

AllowNoneStrategy

Silently ignores all special tokens. They are treated as regular text.

AllowNoneRaiseStrategy

Raises SpecialTokenError if any registered special token is found in the input text.

AllowCustomStrategy(allowed_subset)

Allows only a specified subset of special tokens.

  • allowed_subset (set[str]) -- The special token strings to allow.
# via factory (recommended)
strategy = bytetok.get_strategy("custom", allowed_subset={"<|endoftext|>"})

# or instantiate directly
strategy = bytetok.AllowCustomStrategy({"<|endoftext|>"})

Exceptions

All exceptions inherit from ByteTokError (importable from bytetok.errors).

Exception Raised when
ByteTokError Base exception for all bytetok errors.
VocabularyError vocab_size <= 256 during training, or unknown token ID during decode.
TrainingError Training input is empty or too short.
ModelLoadError Loading a .model file fails (missing, wrong format, version mismatch).
PatternError A regex pattern fails to compile.
SpecialTokenError Special token handling fails (e.g. AllowNoneRaiseStrategy finds one).
StrategyError Unknown strategy name or missing allowed_subset for custom strategy.
TokenizationError General tokenization failure.
from bytetok.errors import ModelLoadError

try:
    tok = bytetok.from_pretrained("missing.model")
except ModelLoadError as e:
    print(e)

Model File Format

save() produces two files:

.model -- Machine-readable format used by load() and from_pretrained():

ByteTok 0.1.0
type regex
re <pattern>
---
<n_special_tokens>
<special_token_string> <token_id>
...
---
<tok_a> <tok_b> <merged_tok>
...

.vocab -- Human-readable vocabulary for inspection:

ST [256] <|endoftext|>
[0] \u0000
...
[258] [he][llo] -> hello

Acknowlegment

ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytetok-0.1.2.tar.gz (32.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bytetok-0.1.2-cp312-abi3-win_amd64.whl (183.9 kB view details)

Uploaded CPython 3.12+Windows x86-64

bytetok-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (331.1 kB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

bytetok-0.1.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (323.4 kB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

bytetok-0.1.2-cp312-abi3-macosx_11_0_arm64.whl (290.8 kB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

bytetok-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl (297.1 kB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file bytetok-0.1.2.tar.gz.

File metadata

  • Download URL: bytetok-0.1.2.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2e0b5e4cfd11650a2d770bc461238da1b09e96266b31299a47efd1e0cf96af08
MD5 efad28ba9e1a834f9027f3d8aee12e99
BLAKE2b-256 eece91d077f4b692fc7d9411265d91eaf55d242d07af3c366e44b9b2505effb9

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: bytetok-0.1.2-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 183.9 kB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f0740a3844a89242022419ce779e80801ed855a007d6a5aec20bb58f089c6287
MD5 1333916a6125afa9510803a335d7f457
BLAKE2b-256 209b9267a42d429e7ba06e05e40a84f8096a78dc05543cec3e4973fa83262431

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ead218dd667bebf4aa92fab7a606c3b33c6a569c2ae6048396a8a7e0ac10dc6b
MD5 8fe4084090c0e46deb177b21ea8868f9
BLAKE2b-256 d400fd397fa66fb3d0fdc159726fcf2f7645da9c79cb4b7e4c92d7100382a24c

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d7e5cc8eec5cf121e502ef060f8797e0c0fc35c421a2a17592259ab245dc1f23
MD5 a9a635409ea7ce8780bb74c094257fab
BLAKE2b-256 642851506d535425594b99ee9b4a058d5242183572054e78e9f5fd91a4bd17f5

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3b130c6341db9cb507ab4bf864b48e3ab0cdab91fb68b57112f193c5a63bab03
MD5 ee18b542a45585200ebeac4210969f30
BLAKE2b-256 0a29ed2fd1b17a8fdb47b637f694c67dfb987e4881ee1bee4fed27b4d3aa5695

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 482d8f226252fcf2c0704ad5c85ab02295f2ddad7091c9f546d2fe8cb790a9f4
MD5 b3855f58c12820dc2c7ab6c1d20f7a68
BLAKE2b-256 fa408cb4ca24861bf9ede0aa89cea640d91a6612535e45b9c5873f07cecc25dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page