Skip to main content

High-performance Python tokenizer backed by IREE

Project description

iree-tokenizer

Python bindings for the IREE tokenizer — a high-performance C tokenizer with full HuggingFace tokenizer.json and OpenAI tiktoken compatibility.

  • Fast. 3–12x faster than tiktoken, 10–20x faster than HF tokenizers. Pure C hot path with zero allocations per token.
  • Zero Python dependencies beyond numpy.
  • Small. ~317KiB (compared to 1-3MiB for alternatives).
  • Streaming encode/decode. First-class support for incremental tokenization — feed chunks in, get tokens out. Ideal for LLM inference.
  • Drop-in compatible. Loads any HuggingFace tokenizer.json or OpenAI .tiktoken vocabulary. Supports BPE, WordPiece, and Unigram models.

Based on the IREE high-speed tokenizer library:

  • Optimized for cache utilization. Efficiently utilizes cache on both large and small CPUs. No dependencies and small footprint make it ideal for embedded/client and inclusion into other projects.
  • Unique Algorithmic optimizations. Pull-based streaming processor with bounded/small, deterministic memory usage. Various novel optimizations not seen elsewhere.
  • GPU-ready. Designed to be compatible with executing tiled on the GPU, not just the host.

Performance

GPT-2 tokenizer, single-threaded, p50 latency over 50 iterations.

Encode (22K chars → 5000 tokens)
  iree       469 µs    10.6M tok/s
  tiktoken  1251 µs     4.0M tok/s   2.7x slower
  hf        5420 µs     0.9M tok/s  11.6x slower

Decode (5000 tokens → text)
  iree        72 µs
  tiktoken    78 µs                   1.1x slower
  hf         599 µs                   8.3x slower

Batch Encode (100 × 880 chars)
  iree      1942 µs    10.3M tok/s
  tiktoken  5148 µs     3.8M tok/s   2.7x slower
  hf       22022 µs     0.9M tok/s  11.3x slower

Measured on AMD Threadripper 3970X, 128 GB DDR4, Fedora 43, GCC 15.2, Python 3.14.

Quick Start

from iree.tokenizer import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

# Or load a tiktoken vocabulary
tok = Tokenizer.from_tiktoken("cl100k_base.tiktoken", encoding="cl100k_base")

# Encode / decode
ids = tok.encode("Hello world")          # [15496, 995]
text = tok.decode(ids)                    # "Hello world"

# Batch
tok.encode_batch(["Hello", "world"])      # [[15496], [995]]

# Numpy (zero-copy)
arr = tok.encode_to_array("Hello world")  # int32 ndarray

# Rich encoding with byte offsets
enc = tok.encode_rich("Hello world", track_offsets=True)
# enc.ids, enc.offsets, enc.type_ids

# Streaming decode (LLM token-at-a-time pattern)
from iree.tokenizer import decode_stream_iter
for chunk in decode_stream_iter(tok, token_generator):
    print(chunk, end="", flush=True)

API

Method Returns Description
Tokenizer.from_file(path) Tokenizer Load from tokenizer.json
Tokenizer.from_str(json) Tokenizer Load from JSON string
Tokenizer.from_buffer(bytes) Tokenizer Load from bytes
Tokenizer.from_tiktoken(path, encoding) Tokenizer Load from .tiktoken file
Tokenizer.from_tiktoken_str(data, encoding) Tokenizer Load from tiktoken data string
Tokenizer.from_tiktoken_buffer(bytes, encoding) Tokenizer Load from tiktoken bytes
tok.encode(text) list[int] Encode text to token IDs
tok.encode_to_array(text) np.ndarray Encode to numpy int32 array
tok.encode_rich(text) Encoding IDs + byte offsets + type IDs
tok.decode(ids) str Decode token IDs to text
tok.encode_batch(texts) list[list[int]] Batch encode
tok.decode_batch(id_lists) list[str] Batch decode
tok.encode_stream() EncodeStream Streaming encoder (context manager)
tok.decode_stream() DecodeStream Streaming decoder (context manager)
tok.vocab_size int Vocabulary size
tok.model_type str "BPE", "WordPiece", or "Unigram"
tok.token_to_id(token) int | None Look up token ID
tok.id_to_token(id) str | None Look up token text

CLI

A streaming iree-tokenizer-python command is included. It reads from stdin, writes JSONL to stdout, and shows live throughput on stderr.

# Encode text to token IDs (HuggingFace tokenizer.json)
echo "Hello world" | iree-tokenizer-python encode -t tokenizer.json

# Encode with a tiktoken vocabulary
echo "Hello world" | iree-tokenizer-python encode -t cl100k_base.tiktoken --encoding cl100k_base
# {"seq":0,"text":"Hello world","ids":[15496,995],"n_tokens":2,...}

# Decode token IDs back to text
echo '[15496, 995]' | iree-tokenizer-python decode -t tokenizer.json
# {"seq":0,"ids":[15496,995],"text":"Hello world","n_tokens":2,...}

# Chain encode → decode (round-trip)
cat corpus.txt | iree-tokenizer-python encode -t tokenizer.json | iree-tokenizer-python decode -t tokenizer.json

# Tokenizer info
iree-tokenizer-python info -t tokenizer.json

Output is chainable: encode output feeds directly into decode and vice versa. Use --compact to omit timing fields, --rich for byte offsets, or --no-progress to suppress the stderr throughput display.

Note that this tool illustrates streaming processing but the overhead of JSON processing is expensive and skews throughput. Treat this as an example of how to operate the streaming API vs a benchmarking tool or a tool expected to achieve maximum throughput.

License

Apache 2.0 with LLVM Exceptions — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl (285.1 kB view details)

Uploaded CPython 3.12+Windows x86-64

iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (341.3 kB view details)

Uploaded CPython 3.12+manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (345.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (345.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 dbc2d209c26409b7d8374b6c0a13f305f4dc323ece7def755a0401dab162eada
MD5 b87c0b6be371ece178ce130a1116a39c
BLAKE2b-256 5b9fead789ec8abade085ae9901db808b64f69a2499fd1392cc6ed1f50969202

See more details on using hashes here.

Provenance

The following attestation bundles were made for iree_tokenizer-0.3.0-cp312-abi3-win_amd64.whl:

Publisher: release.yml on iree-org/iree-tokenizer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0d1e0a16558f6ccf99066d458483d9fb6da9a60fba7256e8e9310c0d4c29ca9e
MD5 47b27994f216fc76b2aca0031398cc74
BLAKE2b-256 f970f756464c8ca3ecab5ae5f34ad29a89a8db45ee7714ffc698e6f1b1cf053c

See more details on using hashes here.

Provenance

The following attestation bundles were made for iree_tokenizer-0.3.0-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: release.yml on iree-org/iree-tokenizer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d3e631e8787df80b7794b9d45fe2fb70b49df09645e5395cd272d1adb0ce1641
MD5 b74280506eebcc65c3ccab828612f5d3
BLAKE2b-256 955f65da63cc41b108246ff251e4ed3948dd73f2e7c466fe724370509b233198

See more details on using hashes here.

Provenance

The following attestation bundles were made for iree_tokenizer-0.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: release.yml on iree-org/iree-tokenizer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 69368143345cfb8a622f2782dc30a4807a771188b74dd5501186e902cdcf061b
MD5 39f716ad1f99c7716cfb0ba87367cccb
BLAKE2b-256 9c9a65b928889e949c5e960fab8ebe92d4a8cde661618078a5cd36cdd5f06470

See more details on using hashes here.

Provenance

The following attestation bundles were made for iree_tokenizer-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: release.yml on iree-org/iree-tokenizer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page