Skip to main content

The fastest tokenizer for modern LLMs, up to 20x faster. Drop-in for transformers.AutoTokenizer, byte-exact, never quadratic.

Project description

tuetoken

The fastest tokenizer for modern LLMs, up to 30x faster.

tuetoken is a BPE tokenizer with a fast, safe Rust core. It is a drop-in replacement for 🤗 transformers.AutoTokenizer: it loads any model's own tokenizer.json and reproduces tokenization exactly (special tokens, chat templates, padding/truncation), up to 20x faster. It also loads OpenAI/tiktoken encodings natively, and its O(n) merger stays fast even on adversarial inputs (hashes, base64, minified code) where other tokenizers degrade to O(n²).

from tuetoken import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
out = tok.apply_chat_template(messages, add_generation_prompt=True)   # {"input_ids", "attention_mask"}

Detection is 100% config-driven (the model's tokenizer.json, never its name), so the same code works across families: Llama, Qwen, Mistral/Mixtral, DeepSeek, Gemma, Phi, GPT-OSS, GLM, Kimi, and more.

Install

pip install tuetoken

Build from source (development, or a platform with no prebuilt wheel):

git clone https://github.com/tuetoken-org/tuetoken && cd tuetoken
pip install maturin
maturin develop --release

Performances

Performances

Drop-in AutoTokenizer

The full 🤗 API, byte-exact with transformers.AutoTokenizer:

from tuetoken import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

tok.encode("Hello <|eot_id|> world")                 # special-token aware -> list[int]
tok.decode(ids, skip_special_tokens=True)            # -> str
tok(texts, padding=True, truncation=True,            # batch dict: input_ids + attention_mask
    max_length=512, return_tensors="np")             #   (also "pt" for torch)
tok.apply_chat_template(messages, add_generation_prompt=True)  # -> {input_ids, attention_mask}
tok.apply_chat_template(messages, add_generation_prompt=True, return_dict=False)  # -> list[int]
tok.batch_decode(...) ; tok.convert_ids_to_tokens(...) ; tok.tokenize(...)
tok.bos_token_id ; tok.eos_token ; tok.pad_token_id ; tok.vocab_size

This matches transformers.AutoTokenizer token-for-token across byte-level models (Llama, Qwen, DeepSeek, …) and SentencePiece models (Mistral, Phi-3, CodeLlama, …).

OpenAI / tiktoken encodings

tuetoken loads OpenAI's encodings natively, with no tiktoken dependency, and is faster than tiktoken itself, by up to an order of magnitude on long inputs:

from tuetoken import Tokenizer
enc = Tokenizer.from_tiktoken("cl100k_base")   # also "o200k_base", "gpt2", ...
enc.encode_ordinary("Hello world")             # list[int]

Lower-level core

Tokenizer is the raw BPE engine (no special tokens or chat templates; that is what AutoTokenizer is for). Reach for it when you only need fast token ids or counts:

from tuetoken import Tokenizer
enc = Tokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")   # or Tokenizer("tokenizer.json")

enc.encode_ordinary("Hello world")                # list[int]
enc.encode_ordinary_batch(texts, num_threads=0)   # parallel (0 = all cores), GIL released
enc.decode(ids) ; enc.count_tokens("Hi") ; len(enc)

For ML pipelines there are zero-copy numpy paths, a padded training-batch helper, and byte-span offsets:

import numpy as np
arr   = np.frombuffer(enc.encode_to_bytes(text), dtype=np.uint32)   # skip per-token boxing
text  = enc.decode_array(arr)
batch = enc.encode_batch(texts, max_length=512, pad_id=0)           # input_ids + attention_mask
ids, offsets = enc.encode_with_offsets("Hello café")                # byte spans, ByteLevel only

Coverage

tuetoken works with essentially every modern LLM tokenizer: ByteLevel BPE (Llama, Qwen, DeepSeek, Mistral/Mixtral, GPT-OSS, GLM, Phi, OLMo, Yi, …), SentencePiece (Llama-2, Mistral, Phi-3, CodeLlama, Gemma, …), and OpenAI/tiktoken encodings. We extend coverage constantly; if a tokenizer you need isn't supported yet, please open an issue.

Anything tuetoken can't reproduce exactly fails closed (raises at load) rather than mistokenizing, so you never get silently wrong tokens.

Linear-time on any input

Classic BPE is O(n²) per chunk and collapses on long, poorly-merging content (random identifiers, hashes, base64, minified code), tiktoken included. tuetoken's merger is O(n), so adversarial inputs that hang other tokenizers for minutes stay in the millisecond range, byte-identical.

Correctness

Every claim above is validated byte-exact against the reference tokenizers (transformers, tokenizers, tiktoken) on a large, adversarial corpus. That is the only reason those libraries appear in the test dependencies; they are not runtime dependencies of tuetoken.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tuetoken-0.1.2.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tuetoken-0.1.2-cp313-cp313-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.13Windows x86-64

tuetoken-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tuetoken-0.1.2-cp313-cp313-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tuetoken-0.1.2-cp312-cp312-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12Windows x86-64

tuetoken-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tuetoken-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tuetoken-0.1.2-cp311-cp311-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.11Windows x86-64

tuetoken-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tuetoken-0.1.2-cp311-cp311-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tuetoken-0.1.2-cp310-cp310-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.10Windows x86-64

tuetoken-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

tuetoken-0.1.2-cp310-cp310-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file tuetoken-0.1.2.tar.gz.

File metadata

  • Download URL: tuetoken-0.1.2.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.2.tar.gz
Algorithm Hash digest
SHA256 aceca529112b5b17c26d680b349416210f7ac3c3996301144a08add8e4ce5610
MD5 25e1bb957b5201ad174a4ccbf2809d23
BLAKE2b-256 f619742ec33ca235af5ccd572a15732a667de9b0c4e6247f345d095e94688c33

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2.tar.gz:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 4ebbedb3c34a72d420192e7911c86cd16374298dc6e86151cc8657802a5b5285
MD5 c590bc262cd16f6c42b9a04bb41e012c
BLAKE2b-256 33f523699ff388cedf8db69afff33c4b27d56b53dacc36fe48cff3895523be51

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp313-cp313-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f1cef15e754ef7ea813de9cd9faad21b18d35f690bfb75113aa9b2eca06c02c3
MD5 065bad11256e022b26531cf5594893a2
BLAKE2b-256 927d45a45db28e6736c42e312ff3e861d9b9a8c14eeb72741d974cda9cb10d9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f9953db7e66d6b377c67984e064dc1f1c78a81ef3097d7ae7d1a5defc5824834
MD5 8e23e5eda922d60c10a7832f0868c231
BLAKE2b-256 8ff3eaee9570641aa556d10e25e232b6d0f740c0ee2ac91dc846f1585a6c363c

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 9d4aecae6afbdaa426806ffa72b82947beeb45ef4b4eecc041e9bcfbf42b17b4
MD5 b596631666cb4d89872c086eabfc8918
BLAKE2b-256 691b60bba7c91890f0243c7e7cdf7c414fd9c5f56071ae521dab6430688ae0a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp312-cp312-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 433f915f0480bca98fe8e0fe790b3169043ca50434827b49fc7d93c41e77db0a
MD5 2fd656ce217d70aeb2ac062290da767f
BLAKE2b-256 c49e1fb95964ca4fc35063ecd2175e2c8b0538978e48061622e61333524ab195

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 da47bb1beb384b5d47ed6e679d428fca35a23efdedcf03012b51454ae5d2be44
MD5 ebf15e54694e6a8890f55586a73ee39e
BLAKE2b-256 5ed139ab6efde6e4b0922bc1cc3eac62944e93c3e8cf140db85adb1ed72b8f72

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7bba7644c34e57dcc4f4483eb368bf2d67206e268e03f78f5fe3245459f1bfa9
MD5 e45ebeb13dfa76bce95580f68686b1b5
BLAKE2b-256 ad5694dabcbf131abaaef49c130ece48b4e44f30699d10e2787dbeccd127e8df

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp311-cp311-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2afe2073d24aec0070fd167f100a69ee5db9457e8b21db6cb6d3b88855ced687
MD5 a94c536dece29b9d31487f87b2dc8a7a
BLAKE2b-256 64a8a9ce820a592203cbba8eca7d6859498fb2519254b1e8974467aea7471851

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c79a39492d83f742775b12028a09eaffa100f6f9a6fcdc7d626a9604632a5504
MD5 2ee73b93f3f466db1771a06681be7a4a
BLAKE2b-256 1b0c92e9085bd709e7a130c1233261e66f6261332fffb5d5f1ebb411063f514f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: tuetoken-0.1.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tuetoken-0.1.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 5f4a521490339b9e5f82d80a2af5cd946c5b93d9a3d76902996724a69f65e5df
MD5 5169aa72eff594edcaadaa9280a8c041
BLAKE2b-256 6511b72572d71df54eceb5f103581abecb715771471a64291a7bf6b2d6d96fc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp310-cp310-win_amd64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54642fa566ab6d40bf95e7589e890f9d32170f652f1ac6c371dcfb7e4e415c93
MD5 51f1a80e228074e438d831f5db5f2de2
BLAKE2b-256 a6ad5a2c05000fbacf596bb74b263ab5a807e7d4a8851212bee9642bb9e86acf

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tuetoken-0.1.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tuetoken-0.1.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a8c52dd3d7001b3b230cd6cd5a06221ef30445136d314fd63676f99a6cafdcf6
MD5 0a868399d99ce28f29dc8226db35a7cd
BLAKE2b-256 6185377b096a139dfeb6541bc65ba0cc3b2ebb7726d2853a6ef7759a53cc84c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for tuetoken-0.1.2-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tuetoken-org/tuetoken

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page