Skip to main content

A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.

Project description

ByteTok

CI PyPI - Version Python versions License

ByteTok implements Byte Pair Encoding (BPE) at the byte-level with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes (0-255), then iteratively merged using learned pair statistics.

The training algorithm is based on an optimized BPE algorithm from the paper A Formal Perspective on Byte-Pair Encoding. The research has enabled ByteTok to achieve O(N log V) training time and O(N log N) encoding time versus the naive O(NV) approach.

Here, N denotes the length of the input text and V is the tokenizer's vocabulary size.

Features

  • High-performance Rust-powered training, encoding, and decoding: Engineered from the ground up with a parallel processing pipeline for efficient handling of large-scale NLP datasets (1GB+) with the aim of enabling rapid processing for modern LLM applications.
  • Built-in regex patterns: Choose from a pre-tokenization regex preset that includes GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2 and DeepSeek.
  • Custom regex patterns: Supported alongside the built-in presets.
  • Special token strategies: Control how special tokens are handled during encoding.
  • Serialization: Supports versioned .model / .vocab file formats for saving tokenizer state, as well as easy loading via a from_pretrained() function.

History

This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:

  • Train on custom text
  • Save learned encodings
  • Encode text
  • Decode text

Feel free to check out robust libraries such as OpenAI's tiktoken and Google's sentencepiece that are widely adopted in production environments. Tiktoken resembles ByteTok the most, but it should be noted that ByteTok provides a training pipeline which Tiktoken lacks.

In contrast, ByteTok was developed with a different focus. It prioritizes simplicity and usability by offering a clear API that efficiently maps strings to lists of token IDs. All this without burdening users with overly complex configuration or excessive parameters.

Benchmarks

These benchmarks were conducted on a Linux x86_64 system equipped with an Intel Core i7-12700H processor (20 cores @ 4.70 GHz) and 32GB DDR5 RAM. Encoding and decoding throughput represent the speed of encode_batch() and decode_batch() operations, respectively.

Dataset: Sci-Fi Books (Gutenberg)

Corpus Size Vocab Size Training Time Encoding Throughput Decoding Throughput Compression Ratio Size Reduction
132.36 MB 10,000 4.58 mins 16.12 MB/sec 82.4M tokens/sec 1.38x 27.5%
216.96 MB 10,000 8.75 mins 13.82 MB/sec 81.4M tokens/sec 1.60x 37.7%
216.96 MB 25,000 9.74 mins 14.55 MB/sec 70.2M tokens/sec 1.68x 40.6%
216.96 MB 50,000 10.67 mins 14.99 MB/sec 77.0M tokens/sec 1.75x 42.7%
326.96 MB 50,000 16.19 mins 14.61 MB/sec 79.3M tokens/sec 1.44x 30.7%

Requirements

  • Python >= 3.12

Installation

Install from PyPI:

# with pip
pip install bytetok

# or with uv (recommended)
uv add bytetok

Building from Source

If you want to develop or build from source, you will need the Rust toolchain rustup.

# clone the repository
git clone https://github.com/VihangaFTW/bytetok.git

# install with uv
uv sync

# or build with maturin
uv sync --group dev
uv run maturin develop --release

Quick Start

Here you will find the primary workflows for using ByteTok tokenizers. For detailed API usage and additional features, see the full documentation in the Wiki.

Basics

The API has been designed with simplicity in mind:

import bytetok as btok


# Create a tokenizer with a built-in pattern (default: gpt4o).
tokenizer = btok.get_tokenizer("gpt4o")

# Train on text.
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode and decode.
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"

# Save and reload.
tokenizer.save("my_tokenizer")
reloaded = btok.from_pretrained("my_tokenizer.model")

Custom regex patterns can be used for pre-tokenization:

import bytetok as btok


# Create a tokenizer with a custom pattern
# For example, split on whitespace and punctuation.
tokenizer = btok.get_tokenizer(custom_pattern = r"\w+|[^\w\s]")

For best results, it is recommended to choose from the built-in presets, which have been extensively validated.

Parallel Encoding

ByteTok supports parallel encoding and decoding for faster processing of large batches of text.

Use encode_batch to perform parallel encoding to efficiently handle large collections of texts. You can then decode the resulting list of token sequences in parallel using decode_batch:

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode a batch of texts in parallel.
texts = ["First document...", "Second document...", "Third document..."]
encoded = tokenizer.encode_batch(texts)

# Decode the batch in parallel.
decoded = tokenizer.decode_batch(encoded, errors="replace")
assert decoded[0] == "First document..."

Special Tokens

Register special tokens after training, then encode with a strategy to control how they are handled:

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Set special tokens (IDs must be >= vocab size).
tokenizer.set_special_tokens({"<|endoftext|>": 15005, "<|pad|>": 13005})

# Encode with strategy: "all" allows special tokens in text; "none" ignores them.
strategy = btok.get_strategy("all")
tokens = tokenizer.encode("Hello<|endoftext|>world", strategy=strategy)

# Batch encoding with special tokens.
encoded = tokenizer.encode_batch(
    ["Doc one.", "Doc two<|pad|>padding", "Doc three."],
    strategy=strategy,
)

ByteTok automatically checks for conflicts when special tokens would replace existing tokens in the vocabulary or if there are duplicates.

Acknowledgment

ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytetok-0.2.0.tar.gz (37.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bytetok-0.2.0-cp312-abi3-win_amd64.whl (995.8 kB view details)

Uploaded CPython 3.12+Windows x86-64

bytetok-0.2.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

bytetok-0.2.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

bytetok-0.2.0-cp312-abi3-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

bytetok-0.2.0-cp312-abi3-macosx_10_12_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file bytetok-0.2.0.tar.gz.

File metadata

  • Download URL: bytetok-0.2.0.tar.gz
  • Upload date:
  • Size: 37.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.3

File hashes

Hashes for bytetok-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3e96686a7a426f6d5cbe19b64eaa9eb5ec45ebd01f367587d9ed8a7ae539c196
MD5 6bea32662214d3876a8b11e08ba9639b
BLAKE2b-256 f64b5411501e548b3ac0115626557aeb6acb62e4b9aec395827345c8e0095e0e

See more details on using hashes here.

File details

Details for the file bytetok-0.2.0-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: bytetok-0.2.0-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 995.8 kB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.3

File hashes

Hashes for bytetok-0.2.0-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5a7852f14315143876edea6e61c3ba2064733e663f7990e40b2e6b2ceed177aa
MD5 7f7a130dced728fc7e65a637f2c5b268
BLAKE2b-256 db62956dbd302d2f91220bcd1c823882a758fdb8ff7a383ce0c135a1a33ce578

See more details on using hashes here.

File details

Details for the file bytetok-0.2.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c692d5173c6c5db90927e7237c0781568ba5a30d41e393e3bc377942f88ec4a6
MD5 ef4691e2fd3df84d5bf52ebc23e336b5
BLAKE2b-256 232ba2277830278ea555c7d7192e7831c3db2a6f608836b2e0e0343904820c1a

See more details on using hashes here.

File details

Details for the file bytetok-0.2.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fed7bdcd355de7b6d054b362ffd755243e5192e4143b68df714adb1c048a680c
MD5 409054f0697fa025b858df8b1c49983c
BLAKE2b-256 e98ef524be4b9eb6be5c7d7563863adb6591a74b9aad446892550d574dd43fe3

See more details on using hashes here.

File details

Details for the file bytetok-0.2.0-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.0-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d7f0a6dca9ac1bf2995c74541abe76edf2e2dbe0f15efd6d3b124b3e11eac803
MD5 8ff104fc01ee2b23b163b6cd063fae8f
BLAKE2b-256 7ab7b77da1fd08a23b07e2c1f7998bf3a8b0b6ebf8a23872cc5e44428b1c2d05

See more details on using hashes here.

File details

Details for the file bytetok-0.2.0-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.0-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ff0f14f56b62a9c5e5446fd95538b2275f0ece364dad41cd602cd54b6b9d6cd5
MD5 fd746738326c897847ad734a08e34f3d
BLAKE2b-256 06fd98b7df53ff08acce5d83b5b8aa737d0f401fc5302852f07c32864364434b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page