Skip to main content

A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.

Project description

ByteTok

CI PyPI - Version Python versions License

ByteTok implements byte-level Byte Pair Encoding (BPE) with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes, then merged according to learned pair statistics.

The training pipeline first pretokenizes the corpus, deduplicates identical pieces, and tracks their frequencies as weighted counts. Merge steps then operate over those weighted pieces instead of repeatedly rescanning the full token stream, which cuts redundant work while preserving the same merge decisions.

If this methodology seems familiar to you, that's because ByteTok's current training algorithm draws inspiration from Hugging Face's implementation!

Features

  • High-performance Rust-powered training, encoding, and decoding: Engineered from the ground up with a parallel processing pipeline for efficient handling of large-scale NLP datasets (1GB+) with the aim of enabling rapid processing for modern LLM applications.
  • Built-in regex patterns: Choose from a pre-tokenization regex preset that includes GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2 and DeepSeek.
  • Custom regex patterns: Supported alongside the built-in presets.
  • Special token strategies: Control how special tokens are handled during encoding.
  • Serialization: Supports versioned .model / .vocab file formats for saving tokenizer state, as well as easy loading via a from_pretrained() function.

History

This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:

  • Train on custom text
  • Save learned encodings
  • Encode text
  • Decode text

Feel free to check out robust libraries such as OpenAI's tiktoken and Google's sentencepiece that are widely adopted in production environments. Tiktoken resembles ByteTok the most, but it should be noted that ByteTok provides a training pipeline which Tiktoken lacks.

In contrast, ByteTok was developed with a different focus. It prioritizes simplicity and usability by offering a clear API that efficiently maps strings to lists of token IDs. All this without burdening users with overly complex configuration or excessive parameters.

Benchmarks

These benchmarks were conducted on a Linux x86_64 system equipped with an Intel Core i7-12700H processor (20 cores @ 4.70 GHz) and 32GB DDR5 RAM. Encoding and decoding throughput represent the speed of encode_batch() and decode_batch() operations, respectively.

Dataset: Sci-Fi Books (Gutenberg)

Corpus Size Vocab Size Training Time Encoding Throughput Decoding Throughput Compression Ratio Size Reduction
132.36 MB 10,000 32.4 secs 14.13 MB/sec 80.9M tokens/sec 3.53x 71.6%
216.96 MB 25,000 1.26 mins 13.65 MB/sec 83.8M tokens/sec 3.66x 72.7%
216.96 MB 50,000 1.38 mins 12.86 MB/sec 81.6M tokens/sec 3.80x 73.7%
326.96 MB 50,000 2.09 mins 12.43 MB/sec 81.6M tokens/sec 3.84x 74.0%
420.36 MB 100,000 4.06 mins 12.00 MB/sec 84.7M tokens/sec 3.96x 74.7%

Requirements

  • Python >= 3.12

Installation

Install from PyPI:

# with pip
pip install bytetok

# or with uv (recommended)
uv add bytetok

Building from Source

If you want to develop or build from source, you will need the Rust toolchain rustup.

# clone the repository
git clone https://github.com/VihangaFTW/bytetok.git

# install with uv
uv sync

# or build with maturin
uv sync --group dev
uv run maturin develop --release

Quick Start

Here you will find the primary workflows for using ByteTok tokenizers. For detailed API usage and additional features, see the full documentation in the Wiki.

Basics

The API has been designed with simplicity in mind:

import bytetok as btok


# Create a tokenizer with a built-in pattern (default: gpt4o).
tokenizer = btok.get_tokenizer("gpt4o")

# Train on text.
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode and decode.
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"

# Save and reload.
tokenizer.save("my_tokenizer")
reloaded = btok.from_pretrained("my_tokenizer.model")

Custom regex patterns can be used for pre-tokenization:

import bytetok as btok


# Create a tokenizer with a custom pattern
# For example, split on whitespace and punctuation.
tokenizer = btok.get_tokenizer(custom_pattern = r"\w+|[^\w\s]")

For best results, it is recommended to choose from the built-in presets, which have been extensively validated.

Parallel Encoding

ByteTok supports parallel encoding and decoding for faster processing of large batches of text.

Use encode_batch to perform parallel encoding to efficiently handle large collections of texts. You can then decode the resulting list of token sequences in parallel using decode_batch:

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode a batch of texts in parallel.
texts = ["First document...", "Second document...", "Third document..."]
encoded = tokenizer.encode_batch(texts, show_progress=False)

# Decode the batch in parallel.
decoded = tokenizer.decode_batch(encoded, errors="replace", show_progress=False)
assert decoded[0] == "First document..."

Special Tokens

Register special tokens after training, then encode with a strategy to control how they are handled:

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Set special tokens (IDs must be >= vocab size).
tokenizer.set_special_tokens({"<|endoftext|>": 15005, "<|pad|>": 13005})

# Encode with strategy: "all" allows special tokens in text; "none" ignores them.
strategy = btok.get_strategy("all")
tokens = tokenizer.encode("Hello<|endoftext|>world", strategy=strategy)

# Batch encoding with special tokens.
encoded = tokenizer.encode_batch(
    ["Doc one.", "Doc two<|pad|>padding", "Doc three."],
    strategy=strategy,
)

ByteTok automatically checks for conflicts when special tokens would replace existing tokens in the vocabulary or if there are duplicates.

For a complete list of special token strategies, see the Wiki documentation.

Acknowledgment

ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytetok-0.3.0.tar.gz (44.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bytetok-0.3.0-cp312-abi3-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12+Windows x86-64

bytetok-0.3.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

bytetok-0.3.0-cp312-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

bytetok-0.3.0-cp312-abi3-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

bytetok-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file bytetok-0.3.0.tar.gz.

File metadata

  • Download URL: bytetok-0.3.0.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c5dde08b17f8ce5dd69cf7b359d4c04fd53e2a1b2006d8c63cea3a5f6e4690df
MD5 cdce8b129b54e259cdf3cc2e21109dd6
BLAKE2b-256 097e8c622007030ac896bc762744d53b856176ed7dbe7b1d2593339de3bc7fee

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: bytetok-0.3.0-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 fc2a703453ec08dbd78772271597bdf6557846eada25f1aa8c5692139dcc2082
MD5 80c78b0042c2359ec66cbf759da3f30a
BLAKE2b-256 75f01cd902f80bf2fcb045e668a0612ff1af812cecce4f1f6029762598447532

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 af96573498cb1194bd501c1c0761ddc0533fd8a539342334487785afe13299aa
MD5 f8c902545548a5a82745e38100c664de
BLAKE2b-256 358a195227542cc430a09633ea01a52845ffad172e993c341deb8ee9b528d1a8

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2aaffd694e9a320dcb5437f6f20a3f35857aa6b19875df1de4a249275b330ee4
MD5 3315dd2a616cd36a6831966525ed1a7a
BLAKE2b-256 8f9d5bfba5c274fb01fa9f7dbff065069a81a8521429a9a400eb31f51bd6ae3e

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.3.0-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3a612b85313784a211d52e554a81a40ad2a8e0552f71e79d7fab77df65c94d32
MD5 3409d3cb31f4de6a3f6bb25bab348e43
BLAKE2b-256 1e834d26a006642ed764b912f77e67928b42840d3117bbfe67e2249c6da2b511

See more details on using hashes here.

File details

Details for the file bytetok-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d83280822b188cd889ccebafd6017d82d34fc4ef76095a53b91de831254bcd8c
MD5 4d93fbe9ae4a6ba6a7991d1a475d9abb
BLAKE2b-256 e29296504040e1f21c021a33ab01cfa2f252bac805baef4b419908918f9c1b67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page