Ultra-fast HuggingFace-compatible tokenizer with SIMD and multi-core support
Project description
BudTikTok Python Bindings
Ultra-fast HuggingFace-compatible tokenizer with SIMD and multi-core support.
Features
- 4-20x faster than HuggingFace tokenizers
- Drop-in replacement for HuggingFace tokenizers API
- SIMD acceleration (AVX2/AVX-512/NEON)
- Multi-core parallelism via Rayon (work-stealing thread pool)
- GIL release during tokenization for true Python parallelism
- Zero-copy where possible
Installation
pip install budtiktok
Or build from source:
cd crates/budtiktok-python
maturin develop --release
Usage
Basic Usage
from budtiktok import Tokenizer
# Load from tokenizer.json
tokenizer = Tokenizer.from_file("path/to/tokenizer.json")
# Single encoding
encoding = tokenizer.encode("Hello, world!", add_special_tokens=True)
print(encoding.ids) # [101, 7592, 117, 2088, 106, 102]
# Batch encoding (parallel)
encodings = tokenizer.encode_batch(["Hello", "World"], add_special_tokens=True)
for enc in encodings:
print(enc.ids)
HuggingFace-Compatible Interface
from budtiktok import Tokenizer
tokenizer = Tokenizer.from_pretrained("path/to/model")
# Just like HuggingFace tokenizers
result = tokenizer(
["Hello, world!", "How are you?"],
max_length=512,
padding="longest",
truncation=True,
return_tensors="np", # or "pt" for PyTorch
)
print(result["input_ids"].shape) # (2, max_len)
print(result["attention_mask"].shape) # (2, max_len)
Token Length Estimation (for Token-Budget Batching)
# Fast path for getting just token lengths
lengths = tokenizer.get_token_lengths(texts, add_special_tokens=True)
Configuration Info
import budtiktok
config = budtiktok.get_config()
print(f"ISA: {config['best_isa']}")
print(f"Physical cores: {config['physical_cores']}")
print(f"SIMD pretokenizer: {config['use_simd_pretokenizer']}")
Performance
Benchmarks on Intel i9-13900K with BERT tokenizer:
| Batch Size | HuggingFace | BudTikTok | Speedup |
|---|---|---|---|
| 1 | 100 µs | 5 µs | 20x |
| 32 | 2000 µs | 200 µs | 10x |
| 1024 | 40000 µs | 4000 µs | 10x |
Integration with LatentBud
BudTikTok is designed for seamless integration with LatentBud:
from infinity_emb.inference.optimizations.budtiktok_tokenizer import (
create_budtiktok_tokenizer,
BUDTIKTOK_AVAILABLE,
)
# Automatically uses BudTikTok if available, falls back to HF
tokenizer = create_budtiktok_tokenizer(model_path, use_budtiktok=True)
Development
Building from Source
Prerequisites:
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install maturin
pip install maturin
Development build:
cd crates/budtiktok-python
maturin develop --release
Build wheel:
maturin build --release
# Output: target/wheels/budtiktok-*.whl
Installation Options
-
From PyPI (recommended):
pip install budtiktok
-
From GitHub:
pip install git+https://github.com/BudEcosystem/budtiktok.git#subdirectory=crates/budtiktok-python
-
From local source:
git clone https://github.com/BudEcosystem/budtiktok.git cd budtiktok/crates/budtiktok-python pip install maturin maturin develop --release
Platform Support
| Platform | Architecture | Status |
|---|---|---|
| Linux | x86_64 | ✅ Fully supported |
| Linux | aarch64 | ✅ Fully supported |
| macOS | x86_64 (Intel) | ✅ Fully supported |
| macOS | aarch64 (Apple Silicon) | ✅ Fully supported |
| Windows | x64 | ✅ Fully supported |
Minimum requirements:
- Python 3.8+
- glibc 2.28+ (Linux)
- macOS 10.12+ (Sierra)
- Windows 10+
Running Tests
# Run Python tests (if available)
pytest crates/budtiktok-python/tests -v
# Run Rust tests
cargo test -p budtiktok-python
# Quick smoke test
python -c "from budtiktok import Tokenizer, get_config; print(get_config())"
Contributing
Contributions are welcome! Please see the main repository for contribution guidelines.
For releases and CI/CD documentation, see RELEASE.md.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file budtiktok-0.1.0.tar.gz.
File metadata
- Download URL: budtiktok-0.1.0.tar.gz
- Upload date:
- Size: 610.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
097e44d94d936c19f4bb438c025490d8e2e540f626b136d4940cf97274f88c78
|
|
| MD5 |
8e5ba89be915ad9d4f5cd50e0a654054
|
|
| BLAKE2b-256 |
9e30032b0ca07e4c27baa4ecec6921b72210a2920835f35333fbd9c55aa32bcc
|
File details
Details for the file budtiktok-0.1.0-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: budtiktok-0.1.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 562.0 kB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6335b70f12b40026364d0d1ae0090407718b12da5d63cdc1e9ec39773d4cdada
|
|
| MD5 |
dec161a7feecd4fef50a0573d3064889
|
|
| BLAKE2b-256 |
a35fab64b5d0ae0a887791bbb9de4c58c392ce3072520cea0077eaa9a7e9deef
|
File details
Details for the file budtiktok-0.1.0-cp38-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: budtiktok-0.1.0-cp38-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 677.8 kB
- Tags: CPython 3.8+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
707301d9981ab7c1d6020c98e675fbd76dd705466b4383d890414b3d6ceb0a9e
|
|
| MD5 |
0fdb32a3178805ea37baa2f037935cfc
|
|
| BLAKE2b-256 |
ee5ce82649b8fa30a7b170f03978a571160a65a3061856c59975681c069bf3a5
|
File details
Details for the file budtiktok-0.1.0-cp38-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: budtiktok-0.1.0-cp38-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 648.5 kB
- Tags: CPython 3.8+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e95788f9994a2c340fef37a5e6cc3855cd364e14ff993705a002de6389ada43
|
|
| MD5 |
5d45010c6e2bca64f9af3db7c9824b43
|
|
| BLAKE2b-256 |
8c435901c7f58494b3c0e671288b02b13d59b920b1d6929601ffda28e07a2373
|
File details
Details for the file budtiktok-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: budtiktok-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 601.3 kB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66e9cf95b4414e11b7e37d128aafbd88ab3c982552bd8a203baa62a502f105be
|
|
| MD5 |
409b47d2bea5f27d67b69373a74ea819
|
|
| BLAKE2b-256 |
88fe73977991674a5a38b4b830864c3499c06265ba1f6028c3ed35b88d84fb9e
|
File details
Details for the file budtiktok-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: budtiktok-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 634.4 kB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a036d3642a912ec9cd40ee2ba0bc56a84977dd874944236df0a5fb14cba040f
|
|
| MD5 |
4bc9c6ee77425b33104a1dc7f14d75e4
|
|
| BLAKE2b-256 |
16c52255d5eaf646237a77e0d2375ebe346f48cffdbb3185670117e3cbd51d10
|