Skip to main content

Fast Rust BPE tokenizer with Python bindings

Project description

splintr

Crates.io PyPI License: MIT

A high-performance BPE tokenizer implemented in Rust with Python bindings, designed for efficient tokenization of text in machine learning applications, particularly for large language models.

Features

splintr implements several optimizations that make tokenization faster and more efficient:

  • PCRE2 with JIT compilation: Uses PCRE2's just-in-time compilation for regex matching, providing 2-4x speedup over fancy-regex on pattern matching operations
  • Rayon parallelism: Leverages multiple CPU cores for encoding batches of text and individual regex chunks within each text
  • Linked-list BPE algorithm: Implements BPE using a linked-list structure that avoids O(N²) complexity on pathological inputs with many repetitive patterns
  • FxHashMap: Uses rustc's FxHasher for faster lookups compared to the default SipHash, trading cryptographic security for speed in non-adversarial contexts
  • Aho-Corasick for special tokens: Employs the Aho-Corasick algorithm for fast multi-pattern matching of special tokens, avoiding regex alternation overhead
  • LRU cache: Caches frequently encoded text chunks to avoid redundant BPE encoding operations
  • UTF-8 streaming decoder: Safely handles token-by-token decoding for LLM output, buffering incomplete UTF-8 sequences across token boundaries

Installation

Python

pip install splintr-rs

Rust

[dependencies]
splintr = "0.1.0-beta.1"

Quick Start

Python

from splintr import Tokenizer

# Load a pretrained tokenizer
tokenizer = Tokenizer.from_pretrained("cl100k_base")

# Encode text to token IDs
tokens = tokenizer.encode("Hello, world!")
print(tokens)  # [9906, 11, 1917, 0]

# Decode token IDs back to text
text = tokenizer.decode(tokens)
print(text)  # "Hello, world!"

# Batch encode multiple texts in parallel
texts = ["Hello, world!", "How are you?", "Machine learning is fun!"]
batch_tokens = tokenizer.encode_batch(texts)
print(batch_tokens)  # [[9906, 11, 1917, 0], [4438, 527, 499, 30], ...]

Rust

use splintr::{Tokenizer, CL100K_BASE_PATTERN};
use rustc_hash::FxHashMap;

// Load vocabulary and create tokenizer
let encoder = load_tiktoken_bpe_file("cl100k_base.tiktoken")?;
let special_tokens = FxHashMap::default();
let tokenizer = Tokenizer::new(encoder, special_tokens, CL100K_BASE_PATTERN)?;

// Encode text
let tokens = tokenizer.encode("Hello, world!");
println!("{:?}", tokens);

// Decode tokens
let text = tokenizer.decode(&tokens)?;
println!("{}", text);

// Batch encode
let texts = vec!["Hello".to_string(), "World".to_string()];
let batch_tokens = tokenizer.encode_batch(&texts);

API Reference

Python API

Tokenizer

Loading a tokenizer:

# Load a pretrained model (includes vocabulary and special tokens)
tokenizer = Tokenizer.from_pretrained("cl100k_base")  # or "o200k_base"

# Load from a custom vocabulary file
tokenizer = Tokenizer(
    vocab_path="path/to/vocab.tiktoken",
    pattern=CL100K_BASE_PATTERN,
    special_tokens={"<|endoftext|>": 100257}
)

Encoding:

  • encode(text: str) -> list[int]: Encode text to token IDs, treating special tokens as regular text
  • encode_with_special(text: str) -> list[int]: Encode text, recognizing special tokens in the input
  • encode_batch(texts: list[str]) -> list[list[int]]: Encode multiple texts in parallel

Decoding:

  • decode(tokens: list[int]) -> str: Decode token IDs to text (raises error on invalid UTF-8)
  • decode_bytes(tokens: list[int]) -> bytes: Decode token IDs to raw bytes
  • decode_lossy(tokens: list[int]) -> str: Decode token IDs, replacing invalid UTF-8 with �

Properties:

  • vocab_size: int: Total vocabulary size including special tokens
  • cache_len: int: Number of entries in the LRU cache

Cache management:

  • clear_cache(): Clear the encoding cache

StreamingDecoder

The streaming decoder is essential for real-time LLM applications where you receive tokens one at a time and need to display text incrementally:

# Create a streaming decoder
decoder = tokenizer.streaming_decoder()

# Process tokens one at a time (typical LLM streaming scenario)
for token_id in token_stream:
    # Returns text only when complete UTF-8 characters are available
    if text := decoder.add_token(token_id):
        print(text, end="", flush=True)

# Flush any remaining buffered bytes at the end
print(decoder.flush())

Why use streaming decoder?

BPE tokens don't always align with UTF-8 character boundaries. For example, a multi-byte Unicode character like "世" (3 bytes: 0xE4 0xB8 0x96) might be split across multiple tokens. The streaming decoder buffers incomplete byte sequences and only outputs text when complete characters are available, preventing display corruption.

Methods:

  • add_token(token_id: int) -> str | None: Add a token and return complete characters, or None if still buffering
  • add_tokens(token_ids: list[int]) -> str | None: Add multiple tokens at once
  • flush() -> str: Flush remaining buffered bytes (incomplete sequences become �)
  • reset(): Clear the buffer and start fresh

Properties:

  • has_pending: bool: Whether there are buffered bytes waiting for completion
  • pending_bytes: int: Number of bytes currently buffered

Rust API

The Rust API provides similar functionality with strongly-typed interfaces. See the API documentation for detailed information.

Streaming Decoder

The streaming decoder is particularly important when working with LLM APIs that stream tokens:

import openai
from splintr import Tokenizer

tokenizer = Tokenizer.from_pretrained("cl100k_base")
decoder = tokenizer.streaming_decoder()

# Example with OpenAI streaming API
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        # Get token IDs from the API (pseudo-code, actual API may vary)
        token_ids = get_token_ids(chunk)

        for token_id in token_ids:
            if text := decoder.add_token(token_id):
                print(text, end="", flush=True)

# Don't forget to flush at the end
print(decoder.flush())

This approach ensures that:

  1. Users see text as soon as complete characters are available
  2. Multi-byte Unicode characters display correctly
  3. No corruption occurs at token boundaries

Performance

Benchmarks performed on Linux (6.16.8-arch3-1) with 24 CPU cores, comparing splintr to tiktoken (the reference Python implementation).

Single Text Encoding

Performance on various text types:

Content Type Size splintr (ms) tiktoken (ms) Speedup
Long English 450,000 chars 7.94 19.91 2.5x
Python Code 59,200 chars 1.67 5.90 3.5x
JSON 29,000 chars 1.20 2.76 2.3x
Numbers 55,000 chars 2.27 6.09 2.7x
Whitespace-heavy 50,000 chars 1.36 4.91 3.6x
Chinese 11,500 chars 1.09 1.45 1.3x

Batch Encoding

Batch operations show significant speedup through parallelism:

Configuration splintr parallel (ms) tiktoken (ms) Speedup vs tiktoken
10 × 1,000 chars 0.25 0.48 1.9x
100 × 1,000 chars 1.11 4.66 4.2x
1,000 × 100 chars 1.42 6.95 4.9x
100 × 10,000 chars 8.24 45.72 5.5x

Parallel speedup within splintr:

  • 100 × 1,000 chars: 8.6x faster (parallel vs sequential)
  • 1,000 × 100 chars: 16.8x faster (parallel vs sequential)

Running Benchmarks

To reproduce these benchmarks or test on your own hardware:

# Clone the repository
git clone https://github.com/farhan/splintr.git
cd splintr

# Install dependencies (requires Python 3.8+)
pip install -e .
pip install tiktoken

# Run the benchmark suite
cd benchmarks
python benchmark.py --model cl100k_base --output results/my_benchmark.json

# View results
cat results/my_benchmark.md

The benchmark suite tests:

  • Single text encoding across various content types (English, code, multilingual, etc.)
  • Batch encoding with different batch sizes and text lengths
  • Streaming decoder performance
  • Special token handling

You can customize the benchmark by modifying benchmark.py or adding your own test data in the data/ directory.

Supported Models

Model Use Case Vocabulary Size Special Tokens Import Constant
cl100k_base GPT-4, GPT-3.5-turbo ~100,000 5 CL100K_BASE_PATTERN
o200k_base GPT-4o ~200,000 2 O200K_BASE_PATTERN

Special tokens:

  • cl100k_base: <|endoftext|>, <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>, <|endofprompt|>
  • o200k_base: <|endoftext|>, <|endofprompt|>

Use Cases

splintr is designed for:

  • LLM applications: Tokenizing prompts and streaming decoder for real-time output display
  • Training pipelines: Fast batch encoding of large datasets for model training
  • Token counting: Estimating API costs or enforcing token limits
  • Text preprocessing: Converting text to tokens for embedding models or other NLP tasks

Contributing

Contributions are welcome! Here's how you can help:

  1. Report bugs: Open an issue with a minimal reproduction case
  2. Suggest features: Describe your use case and why the feature would be helpful
  3. Submit pull requests:
    • Add tests for new functionality
    • Run cargo test and cargo clippy before submitting
    • Update documentation as needed

Development Setup

# Clone the repository
git clone https://github.com/farhan/splintr.git
cd splintr

# Install pre-commit hook (recommended)
cp hooks/pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

# Build the Rust library
cargo build --release

# Build Python bindings
pip install maturin
maturin develop --release

# Run tests
cargo test                    # Rust tests
cargo clippy --all-targets    # Linting
cargo fmt --all --check       # Format check

The pre-commit hook automatically runs formatting, clippy, and tests before each commit.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

splintr builds upon concepts from:

  • tiktoken - OpenAI's reference BPE tokenizer
  • tokenizers - Hugging Face's tokenization library

The performance optimizations are informed by profiling real-world usage patterns in LLM applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splintr_rs-0.1.0b1.tar.gz (2.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl (5.7 MB view details)

Uploaded CPython 3.12Windows x86-64

splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl (5.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file splintr_rs-0.1.0b1.tar.gz.

File metadata

  • Download URL: splintr_rs-0.1.0b1.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splintr_rs-0.1.0b1.tar.gz
Algorithm Hash digest
SHA256 2a9b8b05c49982de38875e7bbb36cbb286d6649fe4593d95390aee4d517eb129
MD5 442e72c1d83c5a65f4dda9bedca2d6a1
BLAKE2b-256 287c38b88368a750c9f91406e782668898018a41f16edfbc406c8dbee928d352

See more details on using hashes here.

Provenance

The following attestation bundles were made for splintr_rs-0.1.0b1.tar.gz:

Publisher: release.yml on farhan-syah/splintr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 27593373eb98eee9bf2b0c47435d0a813f0dd3168d9c72ec01c837201d917d98
MD5 da69ac194b76aa12707e7eaaf82a08e5
BLAKE2b-256 8fffba30f9db1c8c943557eaa224f5f43611a42d96642e3d57a5d2d7d678140c

See more details on using hashes here.

Provenance

The following attestation bundles were made for splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl:

Publisher: release.yml on farhan-syah/splintr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 11a0e8d6ef0a9658246027969bb50142a3198872a0dba737c959734517e3a11a
MD5 25199e729264f489c84af2e338cfbe38
BLAKE2b-256 7190342eed5806f1d49c8ab86ae33c111842267e0bf83fc711ff320e43596ae8

See more details on using hashes here.

Provenance

The following attestation bundles were made for splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on farhan-syah/splintr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c28206ea6111aeb16e4026b349a8e4d1ae5bfaaa0051be504f547a8ce299e4f1
MD5 504a8f7123ccd194c5811aec31550d25
BLAKE2b-256 52b05a1fd9d0d689b8d05a0c898db1007e24d2405f98e61f9d38ceec7d9fbde4

See more details on using hashes here.

Provenance

The following attestation bundles were made for splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: release.yml on farhan-syah/splintr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7af4db3f66fbcace503df2d7733562484bc508a9e3393dc9d73f2053b4e6816c
MD5 9d07d71f7769cc78fef3d4b33ff306aa
BLAKE2b-256 626ac6102ddef5d1809288783547009309c4f89dfdef9d5fb723d65e4366d15d

See more details on using hashes here.

Provenance

The following attestation bundles were made for splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on farhan-syah/splintr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page