Fast Rust BPE tokenizer with Python bindings
Project description
splintr
A high-performance BPE tokenizer implemented in Rust with Python bindings, designed for efficient tokenization of text in machine learning applications, particularly for large language models.
Features
splintr implements several optimizations that make tokenization faster and more efficient:
- PCRE2 with JIT compilation: Uses PCRE2's just-in-time compilation for regex matching, providing 2-4x speedup over fancy-regex on pattern matching operations
- Rayon parallelism: Leverages multiple CPU cores for encoding batches of text and individual regex chunks within each text
- Linked-list BPE algorithm: Implements BPE using a linked-list structure that avoids O(N²) complexity on pathological inputs with many repetitive patterns
- FxHashMap: Uses rustc's FxHasher for faster lookups compared to the default SipHash, trading cryptographic security for speed in non-adversarial contexts
- Aho-Corasick for special tokens: Employs the Aho-Corasick algorithm for fast multi-pattern matching of special tokens, avoiding regex alternation overhead
- LRU cache: Caches frequently encoded text chunks to avoid redundant BPE encoding operations
- UTF-8 streaming decoder: Safely handles token-by-token decoding for LLM output, buffering incomplete UTF-8 sequences across token boundaries
Installation
Python
pip install splintr-rs
Rust
[dependencies]
splintr = "0.1.0-beta.1"
Quick Start
Python
from splintr import Tokenizer
# Load a pretrained tokenizer
tokenizer = Tokenizer.from_pretrained("cl100k_base")
# Encode text to token IDs
tokens = tokenizer.encode("Hello, world!")
print(tokens) # [9906, 11, 1917, 0]
# Decode token IDs back to text
text = tokenizer.decode(tokens)
print(text) # "Hello, world!"
# Batch encode multiple texts in parallel
texts = ["Hello, world!", "How are you?", "Machine learning is fun!"]
batch_tokens = tokenizer.encode_batch(texts)
print(batch_tokens) # [[9906, 11, 1917, 0], [4438, 527, 499, 30], ...]
Rust
use splintr::{Tokenizer, CL100K_BASE_PATTERN};
use rustc_hash::FxHashMap;
// Load vocabulary and create tokenizer
let encoder = load_tiktoken_bpe_file("cl100k_base.tiktoken")?;
let special_tokens = FxHashMap::default();
let tokenizer = Tokenizer::new(encoder, special_tokens, CL100K_BASE_PATTERN)?;
// Encode text
let tokens = tokenizer.encode("Hello, world!");
println!("{:?}", tokens);
// Decode tokens
let text = tokenizer.decode(&tokens)?;
println!("{}", text);
// Batch encode
let texts = vec!["Hello".to_string(), "World".to_string()];
let batch_tokens = tokenizer.encode_batch(&texts);
API Reference
Python API
Tokenizer
Loading a tokenizer:
# Load a pretrained model (includes vocabulary and special tokens)
tokenizer = Tokenizer.from_pretrained("cl100k_base") # or "o200k_base"
# Load from a custom vocabulary file
tokenizer = Tokenizer(
vocab_path="path/to/vocab.tiktoken",
pattern=CL100K_BASE_PATTERN,
special_tokens={"<|endoftext|>": 100257}
)
Encoding:
encode(text: str) -> list[int]: Encode text to token IDs, treating special tokens as regular textencode_with_special(text: str) -> list[int]: Encode text, recognizing special tokens in the inputencode_batch(texts: list[str]) -> list[list[int]]: Encode multiple texts in parallel
Decoding:
decode(tokens: list[int]) -> str: Decode token IDs to text (raises error on invalid UTF-8)decode_bytes(tokens: list[int]) -> bytes: Decode token IDs to raw bytesdecode_lossy(tokens: list[int]) -> str: Decode token IDs, replacing invalid UTF-8 with �
Properties:
vocab_size: int: Total vocabulary size including special tokenscache_len: int: Number of entries in the LRU cache
Cache management:
clear_cache(): Clear the encoding cache
StreamingDecoder
The streaming decoder is essential for real-time LLM applications where you receive tokens one at a time and need to display text incrementally:
# Create a streaming decoder
decoder = tokenizer.streaming_decoder()
# Process tokens one at a time (typical LLM streaming scenario)
for token_id in token_stream:
# Returns text only when complete UTF-8 characters are available
if text := decoder.add_token(token_id):
print(text, end="", flush=True)
# Flush any remaining buffered bytes at the end
print(decoder.flush())
Why use streaming decoder?
BPE tokens don't always align with UTF-8 character boundaries. For example, a multi-byte Unicode character like "世" (3 bytes: 0xE4 0xB8 0x96) might be split across multiple tokens. The streaming decoder buffers incomplete byte sequences and only outputs text when complete characters are available, preventing display corruption.
Methods:
add_token(token_id: int) -> str | None: Add a token and return complete characters, or None if still bufferingadd_tokens(token_ids: list[int]) -> str | None: Add multiple tokens at onceflush() -> str: Flush remaining buffered bytes (incomplete sequences become �)reset(): Clear the buffer and start fresh
Properties:
has_pending: bool: Whether there are buffered bytes waiting for completionpending_bytes: int: Number of bytes currently buffered
Rust API
The Rust API provides similar functionality with strongly-typed interfaces. See the API documentation for detailed information.
Streaming Decoder
The streaming decoder is particularly important when working with LLM APIs that stream tokens:
import openai
from splintr import Tokenizer
tokenizer = Tokenizer.from_pretrained("cl100k_base")
decoder = tokenizer.streaming_decoder()
# Example with OpenAI streaming API
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
# Get token IDs from the API (pseudo-code, actual API may vary)
token_ids = get_token_ids(chunk)
for token_id in token_ids:
if text := decoder.add_token(token_id):
print(text, end="", flush=True)
# Don't forget to flush at the end
print(decoder.flush())
This approach ensures that:
- Users see text as soon as complete characters are available
- Multi-byte Unicode characters display correctly
- No corruption occurs at token boundaries
Performance
Benchmarks performed on Linux (6.16.8-arch3-1) with 24 CPU cores, comparing splintr to tiktoken (the reference Python implementation).
Single Text Encoding
Performance on various text types:
| Content Type | Size | splintr (ms) | tiktoken (ms) | Speedup |
|---|---|---|---|---|
| Long English | 450,000 chars | 7.94 | 19.91 | 2.5x |
| Python Code | 59,200 chars | 1.67 | 5.90 | 3.5x |
| JSON | 29,000 chars | 1.20 | 2.76 | 2.3x |
| Numbers | 55,000 chars | 2.27 | 6.09 | 2.7x |
| Whitespace-heavy | 50,000 chars | 1.36 | 4.91 | 3.6x |
| Chinese | 11,500 chars | 1.09 | 1.45 | 1.3x |
Batch Encoding
Batch operations show significant speedup through parallelism:
| Configuration | splintr parallel (ms) | tiktoken (ms) | Speedup vs tiktoken |
|---|---|---|---|
| 10 × 1,000 chars | 0.25 | 0.48 | 1.9x |
| 100 × 1,000 chars | 1.11 | 4.66 | 4.2x |
| 1,000 × 100 chars | 1.42 | 6.95 | 4.9x |
| 100 × 10,000 chars | 8.24 | 45.72 | 5.5x |
Parallel speedup within splintr:
- 100 × 1,000 chars: 8.6x faster (parallel vs sequential)
- 1,000 × 100 chars: 16.8x faster (parallel vs sequential)
Running Benchmarks
To reproduce these benchmarks or test on your own hardware:
# Clone the repository
git clone https://github.com/farhan/splintr.git
cd splintr
# Install dependencies (requires Python 3.8+)
pip install -e .
pip install tiktoken
# Run the benchmark suite
cd benchmarks
python benchmark.py --model cl100k_base --output results/my_benchmark.json
# View results
cat results/my_benchmark.md
The benchmark suite tests:
- Single text encoding across various content types (English, code, multilingual, etc.)
- Batch encoding with different batch sizes and text lengths
- Streaming decoder performance
- Special token handling
You can customize the benchmark by modifying benchmark.py or adding your own test data in the data/ directory.
Supported Models
| Model | Use Case | Vocabulary Size | Special Tokens | Import Constant |
|---|---|---|---|---|
cl100k_base |
GPT-4, GPT-3.5-turbo | ~100,000 | 5 | CL100K_BASE_PATTERN |
o200k_base |
GPT-4o | ~200,000 | 2 | O200K_BASE_PATTERN |
Special tokens:
- cl100k_base:
<|endoftext|>,<|fim_prefix|>,<|fim_middle|>,<|fim_suffix|>,<|endofprompt|> - o200k_base:
<|endoftext|>,<|endofprompt|>
Use Cases
splintr is designed for:
- LLM applications: Tokenizing prompts and streaming decoder for real-time output display
- Training pipelines: Fast batch encoding of large datasets for model training
- Token counting: Estimating API costs or enforcing token limits
- Text preprocessing: Converting text to tokens for embedding models or other NLP tasks
Contributing
Contributions are welcome! Here's how you can help:
- Report bugs: Open an issue with a minimal reproduction case
- Suggest features: Describe your use case and why the feature would be helpful
- Submit pull requests:
- Add tests for new functionality
- Run
cargo testandcargo clippybefore submitting - Update documentation as needed
Development Setup
# Clone the repository
git clone https://github.com/farhan/splintr.git
cd splintr
# Install pre-commit hook (recommended)
cp hooks/pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit
# Build the Rust library
cargo build --release
# Build Python bindings
pip install maturin
maturin develop --release
# Run tests
cargo test # Rust tests
cargo clippy --all-targets # Linting
cargo fmt --all --check # Format check
The pre-commit hook automatically runs formatting, clippy, and tests before each commit.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
splintr builds upon concepts from:
- tiktoken - OpenAI's reference BPE tokenizer
- tokenizers - Hugging Face's tokenization library
The performance optimizations are informed by profiling real-world usage patterns in LLM applications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file splintr_rs-0.1.0b1.tar.gz.
File metadata
- Download URL: splintr_rs-0.1.0b1.tar.gz
- Upload date:
- Size: 2.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a9b8b05c49982de38875e7bbb36cbb286d6649fe4593d95390aee4d517eb129
|
|
| MD5 |
442e72c1d83c5a65f4dda9bedca2d6a1
|
|
| BLAKE2b-256 |
287c38b88368a750c9f91406e782668898018a41f16edfbc406c8dbee928d352
|
Provenance
The following attestation bundles were made for splintr_rs-0.1.0b1.tar.gz:
Publisher:
release.yml on farhan-syah/splintr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splintr_rs-0.1.0b1.tar.gz -
Subject digest:
2a9b8b05c49982de38875e7bbb36cbb286d6649fe4593d95390aee4d517eb129 - Sigstore transparency entry: 725835400
- Sigstore integration time:
-
Permalink:
farhan-syah/splintr@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Branch / Tag:
refs/tags/v0.1.0-beta.1 - Owner: https://github.com/farhan-syah
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Trigger Event:
push
-
Statement type:
File details
Details for the file splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 5.7 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27593373eb98eee9bf2b0c47435d0a813f0dd3168d9c72ec01c837201d917d98
|
|
| MD5 |
da69ac194b76aa12707e7eaaf82a08e5
|
|
| BLAKE2b-256 |
8fffba30f9db1c8c943557eaa224f5f43611a42d96642e3d57a5d2d7d678140c
|
Provenance
The following attestation bundles were made for splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl:
Publisher:
release.yml on farhan-syah/splintr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splintr_rs-0.1.0b1-cp312-cp312-win_amd64.whl -
Subject digest:
27593373eb98eee9bf2b0c47435d0a813f0dd3168d9c72ec01c837201d917d98 - Sigstore transparency entry: 725835407
- Sigstore integration time:
-
Permalink:
farhan-syah/splintr@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Branch / Tag:
refs/tags/v0.1.0-beta.1 - Owner: https://github.com/farhan-syah
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Trigger Event:
push
-
Statement type:
File details
Details for the file splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 5.5 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11a0e8d6ef0a9658246027969bb50142a3198872a0dba737c959734517e3a11a
|
|
| MD5 |
25199e729264f489c84af2e338cfbe38
|
|
| BLAKE2b-256 |
7190342eed5806f1d49c8ab86ae33c111842267e0bf83fc711ff320e43596ae8
|
Provenance
The following attestation bundles were made for splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl:
Publisher:
release.yml on farhan-syah/splintr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splintr_rs-0.1.0b1-cp312-cp312-macosx_11_0_arm64.whl -
Subject digest:
11a0e8d6ef0a9658246027969bb50142a3198872a0dba737c959734517e3a11a - Sigstore transparency entry: 725835423
- Sigstore integration time:
-
Permalink:
farhan-syah/splintr@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Branch / Tag:
refs/tags/v0.1.0-beta.1 - Owner: https://github.com/farhan-syah
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Trigger Event:
push
-
Statement type:
File details
Details for the file splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 5.5 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c28206ea6111aeb16e4026b349a8e4d1ae5bfaaa0051be504f547a8ce299e4f1
|
|
| MD5 |
504a8f7123ccd194c5811aec31550d25
|
|
| BLAKE2b-256 |
52b05a1fd9d0d689b8d05a0c898db1007e24d2405f98e61f9d38ceec7d9fbde4
|
Provenance
The following attestation bundles were made for splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl:
Publisher:
release.yml on farhan-syah/splintr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splintr_rs-0.1.0b1-cp312-cp312-macosx_10_12_x86_64.whl -
Subject digest:
c28206ea6111aeb16e4026b349a8e4d1ae5bfaaa0051be504f547a8ce299e4f1 - Sigstore transparency entry: 725835415
- Sigstore integration time:
-
Permalink:
farhan-syah/splintr@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Branch / Tag:
refs/tags/v0.1.0-beta.1 - Owner: https://github.com/farhan-syah
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Trigger Event:
push
-
Statement type:
File details
Details for the file splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 5.8 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7af4db3f66fbcace503df2d7733562484bc508a9e3393dc9d73f2053b4e6816c
|
|
| MD5 |
9d07d71f7769cc78fef3d4b33ff306aa
|
|
| BLAKE2b-256 |
626ac6102ddef5d1809288783547009309c4f89dfdef9d5fb723d65e4366d15d
|
Provenance
The following attestation bundles were made for splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on farhan-syah/splintr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splintr_rs-0.1.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
7af4db3f66fbcace503df2d7733562484bc508a9e3393dc9d73f2053b4e6816c - Sigstore transparency entry: 725835410
- Sigstore integration time:
-
Permalink:
farhan-syah/splintr@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Branch / Tag:
refs/tags/v0.1.0-beta.1 - Owner: https://github.com/farhan-syah
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f4199c33ff662b72ac23dc7c4462e9dd66cb1330 -
Trigger Event:
push
-
Statement type: