Skip to main content

Ultra-fast HuggingFace-compatible tokenizer with SIMD and multi-core support

Project description

BudTikTok Python Bindings

Ultra-fast HuggingFace-compatible tokenizer with SIMD and multi-core support.

Features

  • 4-20x faster than HuggingFace tokenizers
  • Drop-in replacement for HuggingFace tokenizers API
  • SIMD acceleration (AVX2/AVX-512/NEON)
  • Multi-core parallelism via Rayon (work-stealing thread pool)
  • GIL release during tokenization for true Python parallelism
  • Zero-copy where possible

Installation

pip install budtiktok

Or build from source:

cd crates/budtiktok-python
maturin develop --release

Usage

Basic Usage

from budtiktok import Tokenizer

# Load from tokenizer.json
tokenizer = Tokenizer.from_file("path/to/tokenizer.json")

# Single encoding
encoding = tokenizer.encode("Hello, world!", add_special_tokens=True)
print(encoding.ids)  # [101, 7592, 117, 2088, 106, 102]

# Batch encoding (parallel)
encodings = tokenizer.encode_batch(["Hello", "World"], add_special_tokens=True)
for enc in encodings:
    print(enc.ids)

HuggingFace-Compatible Interface

from budtiktok import Tokenizer

tokenizer = Tokenizer.from_pretrained("path/to/model")

# Just like HuggingFace tokenizers
result = tokenizer(
    ["Hello, world!", "How are you?"],
    max_length=512,
    padding="longest",
    truncation=True,
    return_tensors="np",  # or "pt" for PyTorch
)

print(result["input_ids"].shape)       # (2, max_len)
print(result["attention_mask"].shape)  # (2, max_len)

Token Length Estimation (for Token-Budget Batching)

# Fast path for getting just token lengths
lengths = tokenizer.get_token_lengths(texts, add_special_tokens=True)

Configuration Info

import budtiktok

config = budtiktok.get_config()
print(f"ISA: {config['best_isa']}")
print(f"Physical cores: {config['physical_cores']}")
print(f"SIMD pretokenizer: {config['use_simd_pretokenizer']}")

Performance

Benchmarks on Intel i9-13900K with BERT tokenizer:

Batch Size HuggingFace BudTikTok Speedup
1 100 µs 5 µs 20x
32 2000 µs 200 µs 10x
1024 40000 µs 4000 µs 10x

Integration with LatentBud

BudTikTok is designed for seamless integration with LatentBud:

from infinity_emb.inference.optimizations.budtiktok_tokenizer import (
    create_budtiktok_tokenizer,
    BUDTIKTOK_AVAILABLE,
)

# Automatically uses BudTikTok if available, falls back to HF
tokenizer = create_budtiktok_tokenizer(model_path, use_budtiktok=True)

Development

Building from Source

Prerequisites:

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install maturin
pip install maturin

Development build:

cd crates/budtiktok-python
maturin develop --release

Build wheel:

maturin build --release
# Output: target/wheels/budtiktok-*.whl

Installation Options

  1. From PyPI (recommended):

    pip install budtiktok
    
  2. From GitHub:

    pip install git+https://github.com/BudEcosystem/budtiktok.git#subdirectory=crates/budtiktok-python
    
  3. From local source:

    git clone https://github.com/BudEcosystem/budtiktok.git
    cd budtiktok/crates/budtiktok-python
    pip install maturin
    maturin develop --release
    

Platform Support

Platform Architecture Status
Linux x86_64 ✅ Fully supported
Linux aarch64 ✅ Fully supported
macOS x86_64 (Intel) ✅ Fully supported
macOS aarch64 (Apple Silicon) ✅ Fully supported
Windows x64 ✅ Fully supported

Minimum requirements:

  • Python 3.8+
  • glibc 2.28+ (Linux)
  • macOS 10.12+ (Sierra)
  • Windows 10+

Running Tests

# Run Python tests (if available)
pytest crates/budtiktok-python/tests -v

# Run Rust tests
cargo test -p budtiktok-python

# Quick smoke test
python -c "from budtiktok import Tokenizer, get_config; print(get_config())"

Contributing

Contributions are welcome! Please see the main repository for contribution guidelines.

For releases and CI/CD documentation, see RELEASE.md.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

budtiktok-0.1.0.tar.gz (610.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

budtiktok-0.1.0-cp38-abi3-win_amd64.whl (562.0 kB view details)

Uploaded CPython 3.8+Windows x86-64

budtiktok-0.1.0-cp38-abi3-manylinux_2_28_x86_64.whl (677.8 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

budtiktok-0.1.0-cp38-abi3-manylinux_2_28_aarch64.whl (648.5 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

budtiktok-0.1.0-cp38-abi3-macosx_11_0_arm64.whl (601.3 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

budtiktok-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl (634.4 kB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file budtiktok-0.1.0.tar.gz.

File metadata

  • Download URL: budtiktok-0.1.0.tar.gz
  • Upload date:
  • Size: 610.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for budtiktok-0.1.0.tar.gz
Algorithm Hash digest
SHA256 097e44d94d936c19f4bb438c025490d8e2e540f626b136d4940cf97274f88c78
MD5 8e5ba89be915ad9d4f5cd50e0a654054
BLAKE2b-256 9e30032b0ca07e4c27baa4ecec6921b72210a2920835f35333fbd9c55aa32bcc

See more details on using hashes here.

File details

Details for the file budtiktok-0.1.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: budtiktok-0.1.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 562.0 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for budtiktok-0.1.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6335b70f12b40026364d0d1ae0090407718b12da5d63cdc1e9ec39773d4cdada
MD5 dec161a7feecd4fef50a0573d3064889
BLAKE2b-256 a35fab64b5d0ae0a887791bbb9de4c58c392ce3072520cea0077eaa9a7e9deef

See more details on using hashes here.

File details

Details for the file budtiktok-0.1.0-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for budtiktok-0.1.0-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 707301d9981ab7c1d6020c98e675fbd76dd705466b4383d890414b3d6ceb0a9e
MD5 0fdb32a3178805ea37baa2f037935cfc
BLAKE2b-256 ee5ce82649b8fa30a7b170f03978a571160a65a3061856c59975681c069bf3a5

See more details on using hashes here.

File details

Details for the file budtiktok-0.1.0-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for budtiktok-0.1.0-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3e95788f9994a2c340fef37a5e6cc3855cd364e14ff993705a002de6389ada43
MD5 5d45010c6e2bca64f9af3db7c9824b43
BLAKE2b-256 8c435901c7f58494b3c0e671288b02b13d59b920b1d6929601ffda28e07a2373

See more details on using hashes here.

File details

Details for the file budtiktok-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for budtiktok-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 66e9cf95b4414e11b7e37d128aafbd88ab3c982552bd8a203baa62a502f105be
MD5 409b47d2bea5f27d67b69373a74ea819
BLAKE2b-256 88fe73977991674a5a38b4b830864c3499c06265ba1f6028c3ed35b88d84fb9e

See more details on using hashes here.

File details

Details for the file budtiktok-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for budtiktok-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5a036d3642a912ec9cd40ee2ba0bc56a84977dd874944236df0a5fb14cba040f
MD5 4bc9c6ee77425b33104a1dc7f14d75e4
BLAKE2b-256 16c52255d5eaf646237a77e0d2375ebe346f48cffdbb3185670117e3cbd51d10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page