High-performance Python tokenizer backed by IREE
Project description
iree-tokenizer
Python bindings for the IREE tokenizer —
a high-performance C tokenizer with full HuggingFace tokenizer.json and
OpenAI tiktoken compatibility.
- Fast. 3–12x faster than tiktoken, 10–20x faster than HF tokenizers. Pure C hot path with zero allocations per token.
- Zero Python dependencies beyond numpy.
- Small. ~317KiB (compared to 1-3MiB for alternatives).
- Streaming encode/decode. First-class support for incremental tokenization — feed chunks in, get tokens out. Ideal for LLM inference.
- Drop-in compatible. Loads any HuggingFace
tokenizer.jsonor OpenAI.tiktokenvocabulary. Supports BPE, WordPiece, and Unigram models.
Based on the IREE high-speed tokenizer library:
- Optimized for cache utilization. Efficiently utilizes cache on both large and small CPUs. No dependencies and small footprint make it ideal for embedded/client and inclusion into other projects.
- Unique Algorithmic optimizations. Pull-based streaming processor with bounded/small, deterministic memory usage. Various novel optimizations not seen elsewhere.
- GPU-ready. Designed to be compatible with executing tiled on the GPU, not just the host.
Performance
GPT-2 tokenizer, single-threaded, p50 latency over 50 iterations.
Encode (22K chars → 5000 tokens)
iree 469 µs 10.6M tok/s
tiktoken 1251 µs 4.0M tok/s 2.7x slower
hf 5420 µs 0.9M tok/s 11.6x slower
Decode (5000 tokens → text)
iree 72 µs
tiktoken 78 µs 1.1x slower
hf 599 µs 8.3x slower
Batch Encode (100 × 880 chars)
iree 1942 µs 10.3M tok/s
tiktoken 5148 µs 3.8M tok/s 2.7x slower
hf 22022 µs 0.9M tok/s 11.3x slower
Measured on AMD Threadripper 3970X, 128 GB DDR4, Fedora 43, GCC 15.2, Python 3.14.
Quick Start
from iree.tokenizer import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
# Or load a tiktoken vocabulary
tok = Tokenizer.from_tiktoken("cl100k_base.tiktoken", encoding="cl100k_base")
# Encode / decode
ids = tok.encode("Hello world") # [15496, 995]
text = tok.decode(ids) # "Hello world"
# Batch
tok.encode_batch(["Hello", "world"]) # [[15496], [995]]
# Numpy (zero-copy)
arr = tok.encode_to_array("Hello world") # int32 ndarray
# Rich encoding with byte offsets
enc = tok.encode_rich("Hello world", track_offsets=True)
# enc.ids, enc.offsets, enc.type_ids
# Streaming decode (LLM token-at-a-time pattern)
from iree.tokenizer import decode_stream_iter
for chunk in decode_stream_iter(tok, token_generator):
print(chunk, end="", flush=True)
API
| Method | Returns | Description |
|---|---|---|
Tokenizer.from_file(path) |
Tokenizer |
Load from tokenizer.json |
Tokenizer.from_str(json) |
Tokenizer |
Load from JSON string |
Tokenizer.from_buffer(bytes) |
Tokenizer |
Load from bytes |
Tokenizer.from_tiktoken(path, encoding) |
Tokenizer |
Load from .tiktoken file |
Tokenizer.from_tiktoken_str(data, encoding) |
Tokenizer |
Load from tiktoken data string |
Tokenizer.from_tiktoken_buffer(bytes, encoding) |
Tokenizer |
Load from tiktoken bytes |
tok.encode(text) |
list[int] |
Encode text to token IDs |
tok.encode_to_array(text) |
np.ndarray |
Encode to numpy int32 array |
tok.encode_rich(text) |
Encoding |
IDs + byte offsets + type IDs |
tok.decode(ids) |
str |
Decode token IDs to text |
tok.encode_batch(texts) |
list[list[int]] |
Batch encode |
tok.decode_batch(id_lists) |
list[str] |
Batch decode |
tok.encode_stream() |
EncodeStream |
Streaming encoder (context manager) |
tok.decode_stream() |
DecodeStream |
Streaming decoder (context manager) |
tok.vocab_size |
int |
Vocabulary size |
tok.model_type |
str |
"BPE", "WordPiece", or "Unigram" |
tok.token_to_id(token) |
int | None |
Look up token ID |
tok.id_to_token(id) |
str | None |
Look up token text |
CLI
A streaming iree-tokenizer-python command is included. It reads from stdin, writes
JSONL to stdout, and shows live throughput on stderr.
# Encode text to token IDs (HuggingFace tokenizer.json)
echo "Hello world" | iree-tokenizer-python encode -t tokenizer.json
# Encode with a tiktoken vocabulary
echo "Hello world" | iree-tokenizer-python encode -t cl100k_base.tiktoken --encoding cl100k_base
# {"seq":0,"text":"Hello world","ids":[15496,995],"n_tokens":2,...}
# Decode token IDs back to text
echo '[15496, 995]' | iree-tokenizer-python decode -t tokenizer.json
# {"seq":0,"ids":[15496,995],"text":"Hello world","n_tokens":2,...}
# Chain encode → decode (round-trip)
cat corpus.txt | iree-tokenizer-python encode -t tokenizer.json | iree-tokenizer-python decode -t tokenizer.json
# Tokenizer info
iree-tokenizer-python info -t tokenizer.json
Output is chainable: encode output feeds directly into decode and vice versa.
Use --compact to omit timing fields, --rich for byte offsets, or
--no-progress to suppress the stderr throughput display.
Note that this tool illustrates streaming processing but the overhead of JSON processing is expensive and skews throughput. Treat this as an example of how to operate the streaming API vs a benchmarking tool or a tool expected to achieve maximum throughput.
License
Apache 2.0 with LLVM Exceptions — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 285.1 kB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbc2d209c26409b7d8374b6c0a13f305f4dc323ece7def755a0401dab162eada
|
|
| MD5 |
b87c0b6be371ece178ce130a1116a39c
|
|
| BLAKE2b-256 |
5b9fead789ec8abade085ae9901db808b64f69a2499fd1392cc6ed1f50969202
|
Provenance
The following attestation bundles were made for iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl:
Publisher:
release.yml on iree-org/iree-tokenizer-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl -
Subject digest:
dbc2d209c26409b7d8374b6c0a13f305f4dc323ece7def755a0401dab162eada - Sigstore transparency entry: 1045534559
- Sigstore integration time:
-
Permalink:
iree-org/iree-tokenizer-py@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/iree-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Trigger Event:
push
-
Statement type:
File details
Details for the file iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 341.3 kB
- Tags: CPython 3.12+, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d1e0a16558f6ccf99066d458483d9fb6da9a60fba7256e8e9310c0d4c29ca9e
|
|
| MD5 |
47b27994f216fc76b2aca0031398cc74
|
|
| BLAKE2b-256 |
f970f756464c8ca3ecab5ae5f34ad29a89a8db45ee7714ffc698e6f1b1cf053c
|
Provenance
The following attestation bundles were made for iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:
Publisher:
release.yml on iree-org/iree-tokenizer-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl -
Subject digest:
0d1e0a16558f6ccf99066d458483d9fb6da9a60fba7256e8e9310c0d4c29ca9e - Sigstore transparency entry: 1045534542
- Sigstore integration time:
-
Permalink:
iree-org/iree-tokenizer-py@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/iree-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Trigger Event:
push
-
Statement type:
File details
Details for the file iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 345.2 kB
- Tags: CPython 3.11, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3e631e8787df80b7794b9d45fe2fb70b49df09645e5395cd272d1adb0ce1641
|
|
| MD5 |
b74280506eebcc65c3ccab828612f5d3
|
|
| BLAKE2b-256 |
955f65da63cc41b108246ff251e4ed3948dd73f2e7c466fe724370509b233198
|
Provenance
The following attestation bundles were made for iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:
Publisher:
release.yml on iree-org/iree-tokenizer-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl -
Subject digest:
d3e631e8787df80b7794b9d45fe2fb70b49df09645e5395cd272d1adb0ce1641 - Sigstore transparency entry: 1045534514
- Sigstore integration time:
-
Permalink:
iree-org/iree-tokenizer-py@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/iree-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Trigger Event:
push
-
Statement type:
File details
Details for the file iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 345.5 kB
- Tags: CPython 3.10, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69368143345cfb8a622f2782dc30a4807a771188b74dd5501186e902cdcf061b
|
|
| MD5 |
39f716ad1f99c7716cfb0ba87367cccb
|
|
| BLAKE2b-256 |
9c9a65b928889e949c5e960fab8ebe92d4a8cde661618078a5cd36cdd5f06470
|
Provenance
The following attestation bundles were made for iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:
Publisher:
release.yml on iree-org/iree-tokenizer-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl -
Subject digest:
69368143345cfb8a622f2782dc30a4807a771188b74dd5501186e902cdcf061b - Sigstore transparency entry: 1045534490
- Sigstore integration time:
-
Permalink:
iree-org/iree-tokenizer-py@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/iree-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7188dc22c8d3d53e1e51f2dfcc6683a7be593a81 -
Trigger Event:
push
-
Statement type: