Skip to main content

Blazingly fast tokenizer — 50x faster, 10x smaller, 100% accurate

Project description

tokie

10-136x faster tokenization, 10x smaller model files, 100% accurate

GitHub · crates.io · HuggingFace


tokie is a fast, correct tokenizer library built in Rust with Python bindings. Drop-in replacement for HuggingFace tokenizers — supports BPE (GPT-2, tiktoken, SentencePiece), WordPiece (BERT), and Unigram encoders.

Installation

pip install tokie

Quick Start

import tokie

# Load from HuggingFace Hub (tries .tkz first, falls back to tokenizer.json)
tokenizer = tokie.Tokenizer.from_pretrained("bert-base-uncased")

# Encode — callable syntax or .encode()
encoding = tokenizer("Hello, world!")
print(encoding.ids)               # [101, 7592, 1010, 2088, 999, 102]
print(encoding.tokens)            # ['[CLS]', 'hello', ',', 'world', '!', '[SEP]']
print(encoding.attention_mask)    # [1, 1, 1, 1, 1, 1]
print(encoding.special_tokens_mask)  # [1, 0, 0, 0, 0, 1]

# Decode
text = tokenizer.decode(encoding.ids)  # "hello , world !"

# Token count (fast, no Encoding overhead)
count = tokenizer.count_tokens("Hello, world!")

# Batch encode (parallel across all cores)
encodings = tokenizer.encode_batch(["Hello!", "World"], add_special_tokens=True)

Padding & Truncation

# Truncate to max length (special tokens preserved)
tokenizer.enable_truncation(max_length=32)

# Pad all sequences in a batch to the same length
tokenizer.enable_padding(length=32, pad_id=tokenizer.pad_token_id or 0)

# Batch encode — all sequences same length, ready for model input
texts = ["Hello world", "Short", "A much longer sentence for testing"]
batch = tokenizer.encode_batch(texts, add_special_tokens=True)
for enc in batch:
    print(len(enc), enc.ids[:5])  # All length 32

Pair Encoding (Cross-Encoders)

pair = tokenizer("How are you?", "I am fine.")  # or tokenizer.encode_pair(...)
print(pair.ids)                # [101, 2129, 2024, 2017, 1029, 102, 1045, 2572, 2986, 1012, 102]
print(pair.type_ids)           # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
print(pair.special_tokens_mask)  # [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]

Byte Offsets

enc = tokenizer.encode_with_offsets("Hello world")
for token_id, (start, end) in zip(enc.ids, enc.offsets):
    print(f"  token {token_id}: bytes [{start}:{end}]")

Save & Load (.tkz format)

tokie's binary .tkz format is ~10x smaller than tokenizer.json and loads in ~5ms:

tokenizer.save("model.tkz")
tokenizer = tokie.Tokenizer.from_file("model.tkz")

Supported Models

Works with any HuggingFace tokenizer — GPT-2, BERT, Llama 3/4, Mistral, Phi, Qwen, T5, XLM-RoBERTa, and more.

Benchmarks

Model Text Size tokie HF tokenizers Speedup
BERT 900 KB 1.69 ms 229 ms 136x
GPT-2 900 KB 1.70 ms 181 ms 107x
Llama 3 900 KB 2.04 ms 190 ms 93x
Qwen 3 45 KB 0.15 ms 8.18 ms 54x
Gemma 3 45 KB 1.01 ms 9.62 ms 10x

100% token-accurate across all models.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokie-0.0.8.tar.gz (135.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokie-0.0.8-cp313-cp313-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.13Windows x86-64

tokie-0.0.8-cp313-cp313-manylinux_2_28_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

tokie-0.0.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tokie-0.0.8-cp313-cp313-macosx_11_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tokie-0.0.8-cp313-cp313-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

tokie-0.0.8-cp312-cp312-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.12Windows x86-64

tokie-0.0.8-cp312-cp312-manylinux_2_28_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

tokie-0.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tokie-0.0.8-cp312-cp312-macosx_11_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tokie-0.0.8-cp312-cp312-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

tokie-0.0.8-cp311-cp311-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.11Windows x86-64

tokie-0.0.8-cp311-cp311-manylinux_2_28_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

tokie-0.0.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tokie-0.0.8-cp311-cp311-macosx_11_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tokie-0.0.8-cp311-cp311-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file tokie-0.0.8.tar.gz.

File metadata

  • Download URL: tokie-0.0.8.tar.gz
  • Upload date:
  • Size: 135.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.8.tar.gz
Algorithm Hash digest
SHA256 13f643109ebbd2d626cfd41056e6533c6b80ea01b2a3f280aed5c587987f0159
MD5 1499fc68432e08af69f59734f8f31a30
BLAKE2b-256 4ce84c45a01cf8b9fd374ff8a3984e87281890e5665ad1976a5cfb71b9f3ee2c

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.8-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.8-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 82382dc287a3d59fc87d59691487fbac136944b1bea61a22922eea1ba30bdf2f
MD5 35ffe68e0ec8a8fbe62cd45aa52578d8
BLAKE2b-256 08152755b426f52a14acd84ebc352581a91a8ba8004185ce887509f1b0bcaf61

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 39c53e636df14ebb4444443c618815456d228eb79f4ef1810dc5f9e60fff844a
MD5 684af8e0299e5a32827450852288da4c
BLAKE2b-256 3026750c8d4e1c0e6b7ae02c97f7bef8ffc35076896338a491357f173a720cd7

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 391035218f10e2816aa6b899cde24af74dd7b3001516109191651238c67d2ae0
MD5 9d3aa325d071f435cdc6aa4f4101f1ab
BLAKE2b-256 59b9e849749aabe485fc0ff914a8fa86e47e6b27ae353acfd52b277b1b600c34

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a93d8c70f09fc41e8e15ade8a0d4838a487f8d8795d8646f5b02ca2d2fc87a74
MD5 8d0fe0d75414dbf64ded427322f3d5d5
BLAKE2b-256 bf9eb360bada2148955287d0487b9441b7dba6187377c57b73301360f9ef0682

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 65e5f8686a885d35730c07d784859fe3c8c24708e20f1cee1f90af0ec902537d
MD5 4469dd573742b9e0652e98fe587f35e5
BLAKE2b-256 2ee3e2a9e8e71410e5465af8e8515856714194f3da8c61c6deb19de7d857d394

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.8-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.8-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ad68389e4bdff253e133ec127cff6b368dfefeb0beaf0c0a401556fe16542c44
MD5 7e4aafc5a5e38c81c3e3e0056728f5ee
BLAKE2b-256 4e9410e831bc5ba2f7c5f6856ff04626f5b992d4b3e538ae219447ef244f47d6

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b4d5b5be1324840017014a3778d40c402cc73d3c7006242f8f3dccdcefd887e8
MD5 c559916348efc470bd1d4e7b5c064077
BLAKE2b-256 bb8e6570f682da2e25e921141dc146126c088183d26d2ee8878f138e3530c960

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 91d614d4ef4dc3257656805ea42bfb33bc7d0b1de67247fb64544779182ebf43
MD5 0454ca7a55a0bfed4546012b9fbbe852
BLAKE2b-256 2fe2a2e4ef9f234f7492115cdd385205a33f88489ea21cb4d4e57e7f9ff4b56f

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a57204ba221b2e5a24d3d4524f8f8d74110a9b9e18f5b491ac38c348d1c2fa22
MD5 ab8b7229e55df4f80e006be6b06ea856
BLAKE2b-256 0ae5c71abc5708760af49a64d35c7b4ddc68de648db4ae2509a78c0a8c4e0d06

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bebceaa76ee7e37e36fa9c4d3d88d659ae2dd43295386a95144a53b244c54326
MD5 54aee219132989a345be0fe2c5161a2c
BLAKE2b-256 dd13721ba861b510ca775496996d7524029e703c047254c6326d67896f29d1cf

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.8-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.8-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0ed0adaf1a344822f02e893a59c3a77d3e77b99645dd9ccb2ea628cff0864d70
MD5 97035a86285826d839d1909a6753090e
BLAKE2b-256 2f4a8f58835bac74d94889e311d575f9c4ee50df67e6dff8c27196214faddb4a

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f67a6bc6765dfcbeb6232c08e27a110a6d737d43b56584f4052fa9a72fe44bb5
MD5 643abdcbec29564fec54686df6163473
BLAKE2b-256 e7dc4d0b1818378128c18221f8896cff63bd61fa5525dae1c048e0092ba5bfb3

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1563751697c3d4e265e3525e54b16a8bd8e2085144634c9d3cb55419b1ce5d88
MD5 acd187691422ca9fc81a01a3f31fe60b
BLAKE2b-256 b519499fd9a76984399ec8043593f093b69cf0ccc470515128c6209fc3f2667a

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3be23140ba91cf76920e29e5a48dafde69295e8e2ffa7207a839f07ab2aa81a4
MD5 95625213d7eee419eeb685f69c42108c
BLAKE2b-256 77629dc68bc0f394a8b987724dc2704e902263cedc0aada83a10bdefb4ed7cd3

See more details on using hashes here.

File details

Details for the file tokie-0.0.8-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.8-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 73658d0d91e027968325ce6535d4ee26851ff6ced17b15d35a1b24f978f0cade
MD5 b82a3e1fa51e0ae69469f0b46b9ebae5
BLAKE2b-256 fdbca50df8e8ecf472dd5b4feee4f2c36bd5a0946540622de956849800c31ac1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page