Skip to main content

The fastest tokenizer for modern LLMs, up to 20x faster. Drop-in for transformers.AutoTokenizer, byte-exact, never quadratic.

Project description

tuetoken

The fastest tokenizer for modern LLMs, up to 20x faster.

tuetoken is a BPE tokenizer with a fast, safe Rust core. It is a drop-in replacement for 🤗 transformers.AutoTokenizer: it loads any model's own tokenizer.json and reproduces tokenization exactly (special tokens, chat templates, padding/truncation), up to 20x faster. It also loads OpenAI/tiktoken encodings natively, and its O(n) merger stays fast even on adversarial inputs (hashes, base64, minified code) where other tokenizers degrade to O(n²).

from tuetoken import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
out = tok.apply_chat_template(messages, add_generation_prompt=True)   # {"input_ids", "attention_mask"}

Detection is 100% config-driven (the model's tokenizer.json, never its name), so the same code works across families: Llama, Qwen, Mistral/Mixtral, DeepSeek, Gemma, Phi, GPT-OSS, GLM, Kimi, and more.

Install

pip install tuetoken

There are no required runtime dependencies: the core is the compiled Rust extension. from_pretrained needs huggingface_hub; chat templates need jinja2; return_tensors= needs numpy/torch (all optional: pip install tuetoken[auto]).

Build from source (development, or a platform with no prebuilt wheel):

git clone https://github.com/pyybor/tuetoken && cd tuetoken
pip install maturin
maturin develop --release

Drop-in AutoTokenizer

The full 🤗 API, byte-exact with transformers.AutoTokenizer:

from tuetoken import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

tok.encode("Hello <|eot_id|> world")                 # special-token aware -> list[int]
tok.decode(ids, skip_special_tokens=True)            # -> str
tok(texts, padding=True, truncation=True,            # batch dict: input_ids + attention_mask
    max_length=512, return_tensors="np")             #   (also "pt" for torch)
tok.apply_chat_template(messages, add_generation_prompt=True)  # -> {input_ids, attention_mask}
tok.apply_chat_template(messages, add_generation_prompt=True, return_dict=False)  # -> list[int]
tok.batch_decode(...) ; tok.convert_ids_to_tokens(...) ; tok.tokenize(...)
tok.bos_token_id ; tok.eos_token ; tok.pad_token_id ; tok.vocab_size

This matches transformers.AutoTokenizer token-for-token across byte-level models (Llama, Qwen, DeepSeek, …) and SentencePiece models (Mistral, Phi-3, CodeLlama, …).

OpenAI / tiktoken encodings

tuetoken loads OpenAI's encodings natively, with no tiktoken dependency, and is faster than tiktoken itself, by up to an order of magnitude on long inputs:

from tuetoken import Tokenizer
enc = Tokenizer.from_tiktoken("cl100k_base")   # also "o200k_base", "gpt2", ...
enc.encode_ordinary("Hello world")             # list[int]

Lower-level core

Tokenizer is the raw BPE engine (no special tokens or chat templates; that is what AutoTokenizer is for). Reach for it when you only need fast token ids or counts:

from tuetoken import Tokenizer
enc = Tokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")   # or Tokenizer("tokenizer.json")

enc.encode_ordinary("Hello world")                # list[int]
enc.encode_ordinary_batch(texts, num_threads=0)   # parallel (0 = all cores), GIL released
enc.decode(ids) ; enc.count_tokens("Hi") ; len(enc)

For ML pipelines there are zero-copy numpy paths, a padded training-batch helper, and byte-span offsets:

import numpy as np
arr   = np.frombuffer(enc.encode_to_bytes(text), dtype=np.uint32)   # skip per-token boxing
text  = enc.decode_array(arr)
batch = enc.encode_batch(texts, max_length=512, pad_id=0)           # input_ids + attention_mask
ids, offsets = enc.encode_with_offsets("Hello café")                # byte spans, ByteLevel only

Coverage

tuetoken works with essentially every modern LLM tokenizer: ByteLevel BPE (Llama, Qwen, DeepSeek, Mistral/Mixtral, GPT-OSS, GLM, Phi, OLMo, Yi, …), SentencePiece (Llama-2, Mistral, Phi-3, CodeLlama, Gemma, …), and OpenAI/tiktoken encodings. We extend coverage constantly; if a tokenizer you need isn't supported yet, please open an issue.

Anything tuetoken can't reproduce exactly fails closed (raises at load) rather than mistokenizing, so you never get silently wrong tokens.

Linear-time on any input

Classic BPE is O(n²) per chunk and collapses on long, poorly-merging content (random identifiers, hashes, base64, minified code), tiktoken included. tuetoken's merger is O(n), so adversarial inputs that hang other tokenizers for minutes stay in the millisecond range, byte-identical.

Correctness

Every claim above is validated byte-exact against the reference tokenizers (transformers, tokenizers, tiktoken) on a large, adversarial corpus. That is the only reason those libraries appear in the test dependencies; they are not runtime dependencies of tuetoken.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tuetoken-0.1.1.tar.gz (703.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tuetoken-0.1.1-cp313-cp313-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.13Windows x86-64

tuetoken-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tuetoken-0.1.1-cp313-cp313-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tuetoken-0.1.1-cp312-cp312-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12Windows x86-64

tuetoken-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tuetoken-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tuetoken-0.1.1-cp311-cp311-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.11Windows x86-64

tuetoken-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tuetoken-0.1.1-cp311-cp311-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tuetoken-0.1.1-cp310-cp310-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.10Windows x86-64

tuetoken-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

tuetoken-0.1.1-cp310-cp310-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file tuetoken-0.1.1.tar.gz.

File metadata

  • Download URL: tuetoken-0.1.1.tar.gz
  • Upload date:
  • Size: 703.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.1.tar.gz
Algorithm Hash digest
SHA256 346194617835cf1ca3c2d8b9d79f14a2f4d3539370a5c253c1b133c641bf7e6a
MD5 8b53483927160632f5518223d58fd7d5
BLAKE2b-256 e535df0df96a2bb5ceba9414b28bde2b23786d5034f43b30784c3169eb1f33a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1.tar.gz:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 0c490d89e37bc2862e2877de39036d06ed49a97a9d353d6be862ddccee5d7104
MD5 e825a9de83b638ce238ae8d71c5c159d
BLAKE2b-256 5be59507f299b17b07a11baf9db7ce2fb2a01e23a7a6024f1455f0161ab3e42d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp313-cp313-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 37a8106b6e9a331e9cefa89030bbac29e4de048e5a30a67e9d730413a1d398bf
MD5 4f62185065fd99a868bee2f768ec4711
BLAKE2b-256 99f6470d047e6be827dbd91bded8c4b00555fba790bcb5db9d07cd4f202aa6bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7380bc968d71d62ef3df99fcd70231c85bd88324eb7e8f979cdf35070644b38c
MD5 baf1cf7c5000810b30e1b8bbac3a9771
BLAKE2b-256 2a135cf9bed1d3907181504954d0db5ca5433fea941da36c311d095a3b9402dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 c1c8a24ea2550192f0d794466e15cf7eb18618880df5874773b8b5bffb0e46f9
MD5 560b615557115cfb0b1ae3ee51f44ff0
BLAKE2b-256 d7a296cf96138efbfbf652465c6224bd87098257d21b455eeed583708402d231

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp312-cp312-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 be0c1696969da0ad87eca064ba31cc1db0f3d8213fced858254c33f72c07d792
MD5 53dbbc3ce038874806429a5a95b0b09e
BLAKE2b-256 66dda82f69ef2701fb99aa52c354230f0c69cf03340511c7810b45b1d4441feb

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 23fd22a4cf436f76fb09f0d8a53343de14112108b2d4e0287ca2f90961d1a748
MD5 500e05a042d975dd8fe335f3903c3311
BLAKE2b-256 41b6efc1ed9e0483b2d9d5f5932221fd9ed161bb086ba2e7ca7a2dc350b5a694

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e5a0a4d6eccda13da07f09a7cd380c0ef17e0e01d1f827c406d6dae74cfa9e28
MD5 c95c4e13ad0c59d2545c77ae04e987e9
BLAKE2b-256 b7febab267f5fd66d7946abab9949315274287ca7e4687c6d5b57ca8da72edec

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp311-cp311-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a1a3a74edefda5367554c8d22fc3a8b2912c8b30352e98109902758b7c9e9970
MD5 211e3293c8a4d5282d9ab5e221bed837
BLAKE2b-256 815e8db3b16599a7958d3c8a7ae0829f32aa9eb93bca7874b10d5f387d9cb84f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1421bd1eb6319b5f03eb114cc7783c986d442ad525a4414838c144ef830cfb40
MD5 52a8fa2c9418da7c4d4a3a146a184c2f
BLAKE2b-256 475a9a20f4909670c68d109590732bfdef4d1a61ad721be9c7404cc5ef000249

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 3aaa7bdbe7c9fc3dfc9af35fea10f278210cedf96681abadb4f257cce76dd6ea
MD5 193626ad7d14f428baeaef79a52bf7a9
BLAKE2b-256 8db84205b61bc85fff3baedb4ecbe30baa0923ebd921dc14f703ff558e2b5a5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp310-cp310-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8f5e110e30c8b34f6b27881d1c88878a98e2b0391ac799b6c59c137bf435d00
MD5 3c04a92c40363d11eab82207d7321902
BLAKE2b-256 11584829f1fc1d362aa8056996fd5b09ea7caa2cbdf8b629efb4b88e167bac54

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e8c2f5534a40f48263eb0f679ad2e783f4cc3c0166aaaa0b00dce4c1d55538d7
MD5 b9969a2ea663ec55287a5c1b23e691cc
BLAKE2b-256 8f2b3439cfe30970195ec2c68d0dfc6a06a25bb6c6ffb9f807c580ad2f8f5d6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.1-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page