Fast bulk tokenizer + token counter for OpenAI BPE encodings.
Project description
toklab
Fast bulk tokenizer + token counter for OpenAI BPE encodings. Rust core wrapping tiktoken-rs, Python frontend.
The problem
You need to count tokens for 100k strings to plan a context budget,
truncate inputs, or estimate cost. Pure-Python tiktoken does this fine
for one-shot calls, but a Python loop over a long list spends most of
its time in interpreter overhead and per-call init.
toklab keeps the same encoding (it uses tiktoken-rs, which ships the
exact byte tables from the official tiktoken release) but exposes a bulk
API that releases the GIL and parallelizes across cores.
Install
pip install toklab
30-second quickstart
from toklab import Tokenizer
tok = Tokenizer.for_model("gpt-4")
print(tok.count("hello world")) # 2
texts = ["hello", "world", "lorem ipsum"]
print(tok.count_many(texts)) # [1, 1, 4]
print(tok.count_many(texts, parallel=True)) # same, distributed across cores
# Length-budgeting helpers.
print(tok.fits("hello world", budget=5)) # True
print(tok.truncate_to("a long sentence" * 100, budget=20))
API
class Tokenizer:
@classmethod
def for_model(cls, model: str) -> Tokenizer: ...
@classmethod
def for_encoding(cls, name: str) -> Tokenizer: ...
# name in {"cl100k_base", "o200k_base"}
def count(self, text: str) -> int: ...
def count_many(self, texts: list[str], *, parallel: bool = False) -> list[int]: ...
def encode(self, text: str) -> list[int]: ...
def decode(self, tokens: list[int]) -> str: ...
def fits(self, text: str, *, budget: int) -> bool: ...
def truncate_to(self, text: str, *, budget: int) -> str: ...
License
Dual-licensed under MIT or Apache-2.0 at your option.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file toklab-0.1.1-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: toklab-0.1.1-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 4.2 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fced022058f36f2e0d80591bc31124b12428a5593408b02b52d67b4e92e18263
|
|
| MD5 |
d45316da9d2d89d849e30f13cd85786c
|
|
| BLAKE2b-256 |
10fd8e2d5a1f03e153484cb501fcfa8cc512a169df90f84b799c2defd19ccf46
|