A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.

These details have not been verified by PyPI

Project links

Project description

ByteTok

PyPI - Version Python versions License

ByteTok implements byte-level Byte Pair Encoding (BPE) with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes, then merged according to learned pair statistics.

The training pipeline first pretokenizes the corpus, deduplicates identical pieces, and tracks their frequencies as weighted counts. Merge steps then operate over those weighted pieces instead of repeatedly rescanning the full token stream, which cuts redundant work while preserving the same merge decisions.

If this methodology seems familiar to you, that's because ByteTok's current training algorithm draws inspiration from Hugging Face's implementation!

Features

High-performance Rust-powered training, encoding, and decoding: Engineered from the ground up with a parallel processing pipeline for efficient handling of large-scale NLP datasets (1GB+) with the aim of enabling rapid processing for modern LLM applications.
Built-in regex patterns: Choose from a pre-tokenization regex preset that includes GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2 and DeepSeek.
Custom regex patterns: Supported alongside the built-in presets.
Special token strategies: Control how special tokens are handled during encoding.
Serialization: Supports versioned .model / .vocab file formats for saving tokenizer state, as well as easy loading via a from_pretrained() function.

History

This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:

Train on custom text
Save learned encodings
Encode text
Decode text

Feel free to check out robust libraries such as OpenAI's tiktoken and Google's sentencepiece that are widely adopted in production environments. Tiktoken resembles ByteTok the most, but it should be noted that ByteTok provides a training pipeline which Tiktoken lacks.

In contrast, ByteTok was developed with a different focus. It prioritizes simplicity and usability by offering a clear API that efficiently maps strings to lists of token IDs. All this without burdening users with overly complex configuration or excessive parameters.

Benchmarks

These benchmarks were conducted on a Linux x86_64 system equipped with an Intel Core i7-12700H processor (20 cores @ 4.70 GHz) and 32GB DDR5 RAM. Encoding and decoding throughput represent the speed of encode_batch() and decode_batch() operations, respectively.

Dataset: Sci-Fi Books (Gutenberg)

Corpus Size	Vocab Size	Training Time	Encoding Throughput	Decoding Throughput	Compression Ratio	Size Reduction
132.36 MB	10,000	32.4 secs	14.13 MB/sec	80.9M tokens/sec	3.53x	71.6%
216.96 MB	25,000	1.26 mins	13.65 MB/sec	83.8M tokens/sec	3.66x	72.7%
216.96 MB	50,000	1.38 mins	12.86 MB/sec	81.6M tokens/sec	3.80x	73.7%
326.96 MB	50,000	2.09 mins	12.43 MB/sec	81.6M tokens/sec	3.84x	74.0%
420.36 MB	100,000	4.06 mins	12.00 MB/sec	84.7M tokens/sec	3.96x	74.7%

Requirements

Python >= 3.12

Installation

Install from PyPI:

# with pip
pip install bytetok

# or with uv (recommended)
uv add bytetok

Building from Source

If you want to develop or build from source, you will need the Rust toolchain rustup.

# clone the repository
git clone https://github.com/VihangaFTW/bytetok.git

# install with uv
uv sync

# or build with maturin
uv sync --group dev
uv run maturin develop --release

Quick Start

Here you will find the primary workflows for using ByteTok tokenizers. For detailed API usage and additional features, see the full documentation in the Wiki.

Basics

The API has been designed with simplicity in mind:

import bytetok as btok


# Create a tokenizer with a built-in pattern (default: gpt4o).
tokenizer = btok.get_tokenizer("gpt4o")

# Train on text.
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode and decode.
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"

# Save and reload.
tokenizer.save("my_tokenizer")
reloaded = btok.from_pretrained("my_tokenizer.model")

Custom regex patterns can be used for pre-tokenization:

import bytetok as btok


# Create a tokenizer with a custom pattern
# For example, split on whitespace and punctuation.
tokenizer = btok.get_tokenizer(custom_pattern = r"\w+|[^\w\s]")

For best results, it is recommended to choose from the built-in presets, which have been extensively validated.

Parallel Encoding

ByteTok supports parallel encoding and decoding for faster processing of large batches of text.

Use encode_batch to perform parallel encoding to efficiently handle large collections of texts. You can then decode the resulting list of token sequences in parallel using decode_batch:

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode a batch of texts in parallel.
texts = ["First document...", "Second document...", "Third document..."]
encoded = tokenizer.encode_batch(texts, show_progress=False)

# Decode the batch in parallel.
decoded = tokenizer.decode_batch(encoded, errors="replace", show_progress=False)
assert decoded[0] == "First document..."

Special Tokens

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Set special tokens (IDs must be >= vocab size).
tokenizer.set_special_tokens({"<|endoftext|>": 15005, "<|pad|>": 13005})

# Encode with strategy: "all" allows special tokens in text; "none" ignores them.
strategy = btok.get_strategy("all")
tokens = tokenizer.encode("Hello<|endoftext|>world", strategy=strategy)

# Batch encoding with special tokens.
encoded = tokenizer.encode_batch(
    ["Doc one.", "Doc two<|pad|>padding", "Doc three."],
    strategy=strategy,
)

ByteTok automatically checks for conflicts when special tokens would replace existing tokens in the vocabulary or if there are duplicates.

For a complete list of special token strategies, see the Wiki documentation.

Acknowledgment

ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Mar 13, 2026

0.2.2

Mar 3, 2026

0.2.1

Mar 1, 2026

0.2.0

Feb 20, 2026

0.1.2

Feb 7, 2026

0.1.1

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytetok-0.3.0.tar.gz (44.2 kB view details)

Uploaded Mar 13, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bytetok-0.3.0-cp312-abi3-win_amd64.whl (1.1 MB view details)

Uploaded Mar 13, 2026 CPython 3.12+Windows x86-64

bytetok-0.3.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded Mar 13, 2026 CPython 3.12+manylinux: glibc 2.17+ ARM64

bytetok-0.3.0-cp312-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded Mar 13, 2026 CPython 3.12+macOS 11.0+ ARM64

bytetok-0.3.0-cp312-abi3-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded Mar 13, 2026 CPython 3.12+macOS 10.12+ x86-64

bytetok-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded Mar 13, 2026 CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file bytetok-0.3.0.tar.gz.

File metadata

Download URL: bytetok-0.3.0.tar.gz
Upload date: Mar 13, 2026
Size: 44.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c5dde08b17f8ce5dd69cf7b359d4c04fd53e2a1b2006d8c63cea3a5f6e4690df`
MD5	`cdce8b129b54e259cdf3cc2e21109dd6`
BLAKE2b-256	`097e8c622007030ac896bc762744d53b856176ed7dbe7b1d2593339de3bc7fee`

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-win_amd64.whl.

File metadata

Download URL: bytetok-0.3.0-cp312-abi3-win_amd64.whl
Upload date: Mar 13, 2026
Size: 1.1 MB
Tags: CPython 3.12+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`fc2a703453ec08dbd78772271597bdf6557846eada25f1aa8c5692139dcc2082`
MD5	`80c78b0042c2359ec66cbf759da3f30a`
BLAKE2b-256	`75f01cd902f80bf2fcb045e668a0612ff1af812cecce4f1f6029762598447532`

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: bytetok-0.3.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Mar 13, 2026
Size: 1.4 MB
Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`af96573498cb1194bd501c1c0761ddc0533fd8a539342334487785afe13299aa`
MD5	`f8c902545548a5a82745e38100c664de`
BLAKE2b-256	`358a195227542cc430a09633ea01a52845ffad172e993c341deb8ee9b528d1a8`

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: bytetok-0.3.0-cp312-abi3-macosx_11_0_arm64.whl
Upload date: Mar 13, 2026
Size: 1.3 MB
Tags: CPython 3.12+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`2aaffd694e9a320dcb5437f6f20a3f35857aa6b19875df1de4a249275b330ee4`
MD5	`3315dd2a616cd36a6831966525ed1a7a`
BLAKE2b-256	`8f9d5bfba5c274fb01fa9f7dbff065069a81a8521429a9a400eb31f51bd6ae3e`

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

Download URL: bytetok-0.3.0-cp312-abi3-macosx_10_12_x86_64.whl
Upload date: Mar 13, 2026
Size: 1.3 MB
Tags: CPython 3.12+, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`3a612b85313784a211d52e554a81a40ad2a8e0552f71e79d7fab77df65c94d32`
MD5	`3409d3cb31f4de6a3f6bb25bab348e43`
BLAKE2b-256	`1e834d26a006642ed764b912f77e67928b42840d3117bbfe67e2249c6da2b511`

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: bytetok-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 13, 2026
Size: 1.4 MB
Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`d83280822b188cd889ccebafd6017d82d34fc4ef76095a53b91de831254bcd8c`
MD5	`4d93fbe9ae4a6ba6a7991d1a475d9abb`
BLAKE2b-256	`e29296504040e1f21c021a33ab01cfa2f252bac805baef4b419908918f9c1b67`

See more details on using hashes here.

bytetok 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ByteTok

Features

History

Benchmarks

Requirements

Installation

Building from Source

Quick Start

Basics

Parallel Encoding

Special Tokens

Acknowledgment

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes