A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.

These details have not been verified by PyPI

Project links

Project description

ByteTok

Python versions License

ByteTok implements Byte Pair Encoding (BPE) at the byte-level with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes (0-255), then iteratively merged using learned pair statistics. The training algorithm is based on Algorithm 2 from "A Formal Perspective on Byte-Pair Encoding", achieving O(N log V) training and O(N log N) encoding versus the naive O(NV) approach.

History

This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:

Train on custom text
Save learned encodings
Encode text
Decode text

Libraries like OpenAI's tiktoken and Google's sentencepiece exist and are probably better for production work. But ByteTok wasn't designed to compete with them or benchmaxx. I wanted a straightforward API that took a string and returned a list of integers; not something that forced me to read through documentation for 200 function arguments (looking at you, sentencepiece).

As my dataset requirements grew, the naive BPE implementation started struggling. So I rewrote the trainer and encoder in Rust using a much more efficient algorithm 😎.

Features

Fast Rust-backed training and encoding via PyO3/maturin for datasets larger than 100MB. ByteTok delivers 600x-1000x better performance when compared to a naive O(NV) implementation.
Built-in regex patterns from GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2, DeepSeek, StarCoder, Falcon, and BLOOM.
Custom patterns supported alongside the built-in presets.
Special token strategies for controlling how special tokens are handled during encoding.
Serialization with versioned .model / .vocab file format and from_pretrained() loader.

Benchmarks

Benchmarks were run on Linux x86_64 with an Intel Core i7-12700H (20 cores @ 4.70 GHz) and 32GB DDR5 RAM.

Dataset	Corpus Size	Vocab Size	Training Time	Encoding Throughput	Decoding Throughput	Compression Ratio	Size Reduction
Sci-Fi Books (Gutenberg)	88.85 MB (93M chars)	25,000	198s (~3.3 mins)	2.99M chars/sec (2.85 MB/sec)	19.2M tokens/sec	1.43x	30.3%
Sci-Fi Books (Gutenberg)	216.96 MB (227M chars)	10,000	523s (~8.7 mins)	2.86M chars/sec (2.73 MB/sec)	17.1M tokens/sec	1.60x	37.7%
Sci-Fi Books (Gutenberg)	216.96 MB (227M chars)	25,000	579s (~9.65 mins)	2.85M chars/sec (2.72 MB/sec)	17.0M tokens/sec	1.68x	40.6%
Sci-Fi Books (Gutenberg)	216.96 MB (227M chars)	50,000	640s (~10.7 mins)	2.76M chars/sec (2.63 MB/sec)	16.4M tokens/sec	1.75x	42.7%
Sci-Fi Books (Gutenberg)	326.96 MB (343M chars)	50,000	1048s (~17.5 mins)	2.82M chars/sec (2.69 MB/sec)	7.02M tokens/sec	1.44x	30.7%

Requirements

Python >= 3.13

Installation

Install from PyPI:

# with pip
pip install bytetok

# or with uv (recommended)
uv add bytetok

Building from Source

If you want to develop or build from source, you'll need a Rust toolchain (rustup):

# clone the repository
git clone https://github.com/vihanga-malaviarachchi/bytetok.git
cd bytetok

# install with uv
uv sync

# or build with maturin
pip install maturin
maturin develop

Quick Start

import bytetok

# create a tokenizer with a built-in pattern (default: gpt4o)
tokenizer = bytetok.get_tokenizer("gpt4o")

# train on text
tokenizer.train("your training corpus here...", vocab_size=1000)

# encode and decode
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"

# save and reload
tokenizer.save("my_tokenizer")
reloaded = bytetok.from_pretrained("my_tokenizer.model")

API Reference

Factory Functions

`bytetok.get_tokenizer(pattern="gpt4o", *, custom_pattern=None)`

Create a RegexTokenizer with a built-in or custom regex pattern.

pattern (str) -- Name of a built-in pattern. Ignored when custom_pattern is set. Default: "gpt4o".
custom_pattern (str | None) -- A custom regex pattern string. Overrides pattern when provided.
Returns: Tokenizer
Raises: PatternError if the custom pattern is invalid regex.

# built-in pattern
tokenizer = bytetok.get_tokenizer("llama3")

# custom pattern
tokenizer = bytetok.get_tokenizer(custom_pattern=r"'s|'t|'re|'ve|'m|'ll|'d| ?\w+")

`bytetok.from_pretrained(model_path)`

Load a previously saved tokenizer from a .model file. The tokenizer type is auto-detected from the file header.

model_path (str) -- Path to the .model file.
Returns: Tokenizer (either BasicTokenizer or RegexTokenizer depending on what was saved).
Raises: ModelLoadError if the file does not exist, has the wrong extension, contains an unknown tokenizer type, or has a version mismatch.

tokenizer = bytetok.from_pretrained("my_tokenizer.model")

`bytetok.get_strategy(name="none-raise", allowed_subset=None)`

Create a special token handling strategy for use with encode().

name ("all" | "none" | "none-raise" | "custom") -- Strategy name.
allowed_subset (set[str] | None) -- Required when name="custom". The set of special token strings to allow.
Returns: SpecialTokenStrategy
Raises: StrategyError if the name is unknown or "custom" is used without allowed_subset.

strategy = bytetok.get_strategy("all")
strategy = bytetok.get_strategy("custom", allowed_subset={"<|endoftext|>"})

`bytetok.list_patterns()`

Return the names of all available built-in regex patterns.

Returns: list[str]

bytetok.list_patterns()
# ['GPT2', 'GPT4', 'GPT4O', 'LLAMA3', 'QWEN2', 'DEEPSEEK_CODER', 'DEEPSEEK_LLM',
#  'STARCODER', 'FALCON', 'BLOOM']

`bytetok.get_pattern(name)`

Get the regex pattern string for a specific built-in pattern by name.

name (str) -- Name of the built-in pattern (case-insensitive).
Returns: str -- The regex pattern string.
Raises: PatternError if the pattern name is unknown.

# get a specific pattern string
pattern_str = bytetok.get_pattern("llama3")

# use it to create a tokenizer
tokenizer = bytetok.RegexTokenizer(pattern=pattern_str)

`bytetok.list_strategies()`

Return the names of all available special token strategies.

Returns: list[str]

bytetok.list_strategies()
# ['all', 'none', 'none-raise', 'custom']

Tokenizer Classes

All tokenizers inherit from the abstract base class Tokenizer. The two concrete implementations are BasicTokenizer and RegexTokenizer.

The BasicTokenizer serves as a documentation for the simplest implementation of a BPE tokenizer. It is not recommended for actual use because decode() reconstructs text by decoding a raw byte stream as UTF-8 with replacement (errors="replace"), which can lose information for invalid UTF-8 byte sequences (e.g. broken multi-byte codepoints become �).

All ByteTok's factory methods default to RegexTokenizer. For custom extensions or implementations, always inherit from RegexTokenizer.

`Tokenizer` (abstract base class)

Manages vocabulary, byte pair merges, and serialization. You do not instantiate this directly; use RegexTokenizer or the factory functions instead.

Attributes

Attribute	Type	Description
`merges`	`dict[tuple[int,int], int]`	Byte pair -> merged token ID mapping.
`vocab`	`dict[int, bytes]`	Token ID -> byte sequence mapping.
`pat`	`str`	Regex pattern used for text splitting (if any).
`special_toks`	`dict[str, int]`	Special token string -> token ID mapping.

`train(text, vocab_size, verbose=False)`

Train the tokenizer by learning byte pair merges from the input.

text (str | list[str]) -- Training corpus. Lists are concatenated.
vocab_size (int) -- Target vocabulary size. Must be > 256.
verbose (bool) -- Log each merge operation. Default: False.
Raises: VocabularyError if vocab_size <= 256. TrainingError if the input is empty.

`encode(text, strategy=None)`

Encode text into a list of integer token IDs.

text (str) -- Text to encode.
strategy (SpecialTokenStrategy | None) -- How to handle special tokens. None means no special token handling.
Returns: list[int]

`decode(tokens)`

Decode a list of token IDs back into text.

tokens (list[int]) -- Token IDs to decode.
Returns: str
Raises: VocabularyError if a token ID is not in the vocabulary (RegexTokenizer).

`save(file_prefix)`

Save the trained tokenizer to disk. Creates two files:

<file_prefix>.model -- Binary merge mappings (used by load() / from_pretrained()).
<file_prefix>.vocab -- Human-readable token representations.

Parameters:

file_prefix (str) -- Path prefix for the output files.

tokenizer.save("models/my_tok")
# creates models/my_tok.model and models/my_tok.vocab

`load(model_filename)`

Load tokenizer state from a .model file. Restores merges, special tokens, and rebuilds the vocabulary.

model_filename (str) -- Path to the .model file.
Raises: ModelLoadError on missing file, wrong extension, version mismatch, or type mismatch.

tokenizer = bytetok.RegexTokenizer()
tokenizer.load("models/my_tok.model")

`BasicTokenizer()`

Tokenizer that operates directly on raw byte sequences without any regex splitting. Does not support special token strategies.

tok = bytetok.BasicTokenizer()
tok.train("Hello world", vocab_size=300)
tokens = tok.encode("Hello")
text = tok.decode(tokens)

All methods are inherited from Tokenizer. The strategy parameter on encode() is accepted but ignored.

It is recommended not to use this class. Use RegexTokenizer instead.

`RegexTokenizer(pattern=None)`

Tokenizer that splits text with a regex pattern before applying BPE. Supports special token registration and strategies.

pattern (str | None) -- Regex pattern for text splitting. Defaults to the gpt4o pattern when None.

tok = bytetok.RegexTokenizer()                     # default gpt4o pattern
tok = bytetok.RegexTokenizer(pattern=r"\w+|\S")    # custom pattern

In addition to the methods inherited from Tokenizer, RegexTokenizer provides:

`register_special_tokens(special_toks)`

Register special tokens with auto-assigned IDs. Must be called after training. Token IDs are assigned sequentially starting from the current vocabulary size.

special_toks (list[str]) -- Special token strings to register.
Raises: SpecialTokenError if the tokenizer has not been trained yet.

tok.train(text, vocab_size=1000)
tok.register_special_tokens(["<|endoftext|>", "<|pad|>", "<|start|>"])

# encode with special token awareness
strategy = bytetok.get_strategy("all")
tokens = tok.encode("Hello<|endoftext|>", strategy=strategy)
text = tok.decode(tokens)

TokenPattern

TokenPattern is a str enum containing pre-defined regex patterns sourced from popular tokenizer implementations.

`TokenPattern.get(name)`

Look up a pattern by name (case-insensitive).

name (str) -- Pattern name.
Returns: str -- The regex pattern string.
Raises: PatternError if the name is unknown.

pattern = bytetok.TokenPattern.get("gpt4o")

Available Patterns

gpt2
gpt4
gpt4o
llama3
qwen2
deepseek-coder
deepseek-llm
starcoder
falcon
bloom

Special Token Strategies

Strategies control how special tokens are recognised during encode(). Pass a strategy instance as the strategy parameter.

`SpecialTokenStrategy` (abstract base class)

Base class. Subclass this to implement custom strategies.

`handle(text, special_toks)`

text (str) -- The text being encoded.
special_toks (dict[str, int]) -- All registered special tokens.
Returns: dict[str, int] -- The subset of special tokens to apply.

`AllowAllStrategy`

Allows all registered special tokens to be recognised during encoding.

`AllowNoneStrategy`

Silently ignores all special tokens. They are treated as regular text.

`AllowNoneRaiseStrategy`

Raises SpecialTokenError if any registered special token is found in the input text.

`AllowCustomStrategy(allowed_subset)`

Allows only a specified subset of special tokens.

allowed_subset (set[str]) -- The special token strings to allow.

# via factory (recommended)
strategy = bytetok.get_strategy("custom", allowed_subset={"<|endoftext|>"})

# or instantiate directly
strategy = bytetok.AllowCustomStrategy({"<|endoftext|>"})

Exceptions

All exceptions inherit from ByteTokError (importable from bytetok.errors).

Exception	Raised when
`ByteTokError`	Base exception for all bytetok errors.
`VocabularyError`	`vocab_size <= 256` during training, or unknown token ID during decode.
`TrainingError`	Training input is empty or too short.
`ModelLoadError`	Loading a `.model` file fails (missing, wrong format, version mismatch).
`PatternError`	A regex pattern fails to compile.
`SpecialTokenError`	Special token handling fails (e.g. `AllowNoneRaiseStrategy` finds one).
`StrategyError`	Unknown strategy name or missing `allowed_subset` for custom strategy.
`TokenizationError`	General tokenization failure.

from bytetok.errors import ModelLoadError

try:
    tok = bytetok.from_pretrained("missing.model")
except ModelLoadError as e:
    print(e)

Model File Format

save() produces two files:

.model -- Machine-readable format used by load() and from_pretrained():

ByteTok 0.1.0
type regex
re <pattern>
---
<n_special_tokens>
<special_token_string> <token_id>
...
---
<tok_a> <tok_b> <merged_tok>
...

.vocab -- Human-readable vocabulary for inspection:

ST [256] <|endoftext|>
[0] \u0000
...
[258] [he][llo] -> hello

Acknowlegment

ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Mar 13, 2026

0.2.2

Mar 3, 2026

0.2.1

Mar 1, 2026

0.2.0

Feb 20, 2026

This version

0.1.2

Feb 7, 2026

0.1.1

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytetok-0.1.2.tar.gz (32.3 kB view details)

Uploaded Feb 7, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bytetok-0.1.2-cp312-abi3-win_amd64.whl (183.9 kB view details)

Uploaded Feb 7, 2026 CPython 3.12+Windows x86-64

bytetok-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (331.1 kB view details)

Uploaded Feb 7, 2026 CPython 3.12+manylinux: glibc 2.17+ x86-64

bytetok-0.1.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (323.4 kB view details)

Uploaded Feb 7, 2026 CPython 3.12+manylinux: glibc 2.17+ ARM64

bytetok-0.1.2-cp312-abi3-macosx_11_0_arm64.whl (290.8 kB view details)

Uploaded Feb 7, 2026 CPython 3.12+macOS 11.0+ ARM64

bytetok-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl (297.1 kB view details)

Uploaded Feb 7, 2026 CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file bytetok-0.1.2.tar.gz.

File metadata

Download URL: bytetok-0.1.2.tar.gz
Upload date: Feb 7, 2026
Size: 32.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`2e0b5e4cfd11650a2d770bc461238da1b09e96266b31299a47efd1e0cf96af08`
MD5	`efad28ba9e1a834f9027f3d8aee12e99`
BLAKE2b-256	`eece91d077f4b692fc7d9411265d91eaf55d242d07af3c366e44b9b2505effb9`

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-win_amd64.whl.

File metadata

Download URL: bytetok-0.1.2-cp312-abi3-win_amd64.whl
Upload date: Feb 7, 2026
Size: 183.9 kB
Tags: CPython 3.12+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`f0740a3844a89242022419ce779e80801ed855a007d6a5aec20bb58f089c6287`
MD5	`1333916a6125afa9510803a335d7f457`
BLAKE2b-256	`209b9267a42d429e7ba06e05e40a84f8096a78dc05543cec3e4973fa83262431`

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: bytetok-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Feb 7, 2026
Size: 331.1 kB
Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`ead218dd667bebf4aa92fab7a606c3b33c6a569c2ae6048396a8a7e0ac10dc6b`
MD5	`8fe4084090c0e46deb177b21ea8868f9`
BLAKE2b-256	`d400fd397fa66fb3d0fdc159726fcf2f7645da9c79cb4b7e4c92d7100382a24c`

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: bytetok-0.1.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Feb 7, 2026
Size: 323.4 kB
Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`d7e5cc8eec5cf121e502ef060f8797e0c0fc35c421a2a17592259ab245dc1f23`
MD5	`a9a635409ea7ce8780bb74c094257fab`
BLAKE2b-256	`642851506d535425594b99ee9b4a058d5242183572054e78e9f5fd91a4bd17f5`

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: bytetok-0.1.2-cp312-abi3-macosx_11_0_arm64.whl
Upload date: Feb 7, 2026
Size: 290.8 kB
Tags: CPython 3.12+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`3b130c6341db9cb507ab4bf864b48e3ab0cdab91fb68b57112f193c5a63bab03`
MD5	`ee18b542a45585200ebeac4210969f30`
BLAKE2b-256	`0a29ed2fd1b17a8fdb47b637f694c67dfb987e4881ee1bee4fed27b4d3aa5695`

See more details on using hashes here.

File details

Details for the file bytetok-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

Download URL: bytetok-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl
Upload date: Feb 7, 2026
Size: 297.1 kB
Tags: CPython 3.12+, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for bytetok-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`482d8f226252fcf2c0704ad5c85ab02295f2ddad7091c9f546d2fe8cb790a9f4`
MD5	`b3855f58c12820dc2c7ab6c1d20f7a68`
BLAKE2b-256	`fa408cb4ca24861bf9ede0aa89cea640d91a6612535e45b9c5873f07cecc25dc`

See more details on using hashes here.

bytetok 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ByteTok

History

Features

Benchmarks

Requirements

Installation

Building from Source

Quick Start

API Reference

Factory Functions

bytetok.get_tokenizer(pattern="gpt4o", *, custom_pattern=None)

bytetok.from_pretrained(model_path)

bytetok.get_strategy(name="none-raise", allowed_subset=None)

bytetok.list_patterns()

bytetok.get_pattern(name)

bytetok.list_strategies()

Tokenizer Classes

Tokenizer (abstract base class)

Attributes

train(text, vocab_size, verbose=False)

encode(text, strategy=None)

decode(tokens)

save(file_prefix)

load(model_filename)

BasicTokenizer()

RegexTokenizer(pattern=None)

register_special_tokens(special_toks)

TokenPattern

TokenPattern.get(name)

Available Patterns

Special Token Strategies

SpecialTokenStrategy (abstract base class)

handle(text, special_toks)

AllowAllStrategy

AllowNoneStrategy

AllowNoneRaiseStrategy

AllowCustomStrategy(allowed_subset)

Exceptions

Model File Format

Acknowlegment

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

`bytetok.get_tokenizer(pattern="gpt4o", *, custom_pattern=None)`

`bytetok.from_pretrained(model_path)`

`bytetok.get_strategy(name="none-raise", allowed_subset=None)`

`bytetok.list_patterns()`

`bytetok.get_pattern(name)`

`bytetok.list_strategies()`

`Tokenizer` (abstract base class)

`train(text, vocab_size, verbose=False)`

`encode(text, strategy=None)`

`decode(tokens)`

`save(file_prefix)`

`load(model_filename)`

`BasicTokenizer()`

`RegexTokenizer(pattern=None)`

`register_special_tokens(special_toks)`

`TokenPattern.get(name)`

`SpecialTokenStrategy` (abstract base class)

`handle(text, special_toks)`

`AllowAllStrategy`

`AllowNoneStrategy`

`AllowNoneRaiseStrategy`

`AllowCustomStrategy(allowed_subset)`