Skip to main content

Blazingly fast tokenizer — 50x faster, 10x smaller, 100% accurate

Project description

tokie

10-136x faster tokenization, 10x smaller model files, 100% accurate

GitHub · crates.io · HuggingFace


tokie is a fast, correct tokenizer library built in Rust with Python bindings. Drop-in replacement for HuggingFace tokenizers — supports BPE (GPT-2, tiktoken, SentencePiece), WordPiece (BERT), and Unigram encoders.

Installation

pip install tokie

Quick Start

import tokie

# Load from HuggingFace Hub (tries .tkz first, falls back to tokenizer.json)
tokenizer = tokie.Tokenizer.from_pretrained("bert-base-uncased")

# Encode — callable syntax or .encode()
encoding = tokenizer("Hello, world!")
print(encoding.ids)               # [101, 7592, 1010, 2088, 999, 102]
print(encoding.tokens)            # ['[CLS]', 'hello', ',', 'world', '!', '[SEP]']
print(encoding.attention_mask)    # [1, 1, 1, 1, 1, 1]
print(encoding.special_tokens_mask)  # [1, 0, 0, 0, 0, 1]

# Decode
text = tokenizer.decode(encoding.ids)  # "hello , world !"

# Token count (fast, no Encoding overhead)
count = tokenizer.count_tokens("Hello, world!")

# Batch encode (parallel across all cores)
encodings = tokenizer.encode_batch(["Hello!", "World"], add_special_tokens=True)

Padding & Truncation

# Truncate to max length (special tokens preserved)
tokenizer.enable_truncation(max_length=32)

# Pad all sequences in a batch to the same length
tokenizer.enable_padding(length=32, pad_id=tokenizer.pad_token_id or 0)

# Batch encode — all sequences same length, ready for model input
texts = ["Hello world", "Short", "A much longer sentence for testing"]
batch = tokenizer.encode_batch(texts, add_special_tokens=True)
for enc in batch:
    print(len(enc), enc.ids[:5])  # All length 32

Pair Encoding (Cross-Encoders)

pair = tokenizer("How are you?", "I am fine.")  # or tokenizer.encode_pair(...)
print(pair.ids)                # [101, 2129, 2024, 2017, 1029, 102, 1045, 2572, 2986, 1012, 102]
print(pair.type_ids)           # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
print(pair.special_tokens_mask)  # [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]

Byte Offsets

enc = tokenizer.encode_with_offsets("Hello world")
for token_id, (start, end) in zip(enc.ids, enc.offsets):
    print(f"  token {token_id}: bytes [{start}:{end}]")

Save & Load (.tkz format)

tokie's binary .tkz format is ~10x smaller than tokenizer.json and loads in ~5ms:

tokenizer.save("model.tkz")
tokenizer = tokie.Tokenizer.from_file("model.tkz")

Supported Models

Works with any HuggingFace tokenizer — GPT-2, BERT, Llama 3/4, Mistral, Phi, Qwen, T5, XLM-RoBERTa, and more.

Benchmarks

Model Text Size tokie HF tokenizers Speedup
BERT 900 KB 1.69 ms 229 ms 136x
GPT-2 900 KB 1.70 ms 181 ms 107x
Llama 3 900 KB 2.04 ms 190 ms 93x
Qwen 3 45 KB 0.15 ms 8.18 ms 54x
Gemma 3 45 KB 1.01 ms 9.62 ms 10x

100% token-accurate across all models.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokie-0.0.9.tar.gz (135.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokie-0.0.9-cp313-cp313-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.13Windows x86-64

tokie-0.0.9-cp313-cp313-manylinux_2_28_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

tokie-0.0.9-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tokie-0.0.9-cp313-cp313-macosx_11_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tokie-0.0.9-cp313-cp313-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

tokie-0.0.9-cp312-cp312-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.12Windows x86-64

tokie-0.0.9-cp312-cp312-manylinux_2_28_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

tokie-0.0.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tokie-0.0.9-cp312-cp312-macosx_11_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tokie-0.0.9-cp312-cp312-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

tokie-0.0.9-cp311-cp311-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.11Windows x86-64

tokie-0.0.9-cp311-cp311-manylinux_2_28_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

tokie-0.0.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tokie-0.0.9-cp311-cp311-macosx_11_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tokie-0.0.9-cp311-cp311-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

tokie-0.0.9-cp310-cp310-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.10Windows x86-64

tokie-0.0.9-cp310-cp310-manylinux_2_28_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ ARM64

tokie-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

tokie-0.0.9-cp310-cp310-macosx_11_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

tokie-0.0.9-cp310-cp310-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file tokie-0.0.9.tar.gz.

File metadata

  • Download URL: tokie-0.0.9.tar.gz
  • Upload date:
  • Size: 135.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tokie-0.0.9.tar.gz
Algorithm Hash digest
SHA256 b2439850258890d7ecf1e21aeeebe06d8ae81879b8349959438da2841fe6553c
MD5 547a96d0d7f0bf67401a642925520a42
BLAKE2b-256 953cf76865a207ca5fdac17842ddcd5ca4b27699c81cb8424b3290d9b3603970

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.9-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tokie-0.0.9-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 559d16d6f6898aa1cea5a8be329e5f1654c889d0f25d9652d98c3f88ccf7aca9
MD5 f5a7d873fe8ef68fff1bf474f48d3291
BLAKE2b-256 d09685580cd16e5e97486ad53e8227b3acb10327d68bb8b38a64b405719b5601

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7dfb5d6023585ff18b0f1e2d00e4cd37f872c0055c3d6f3b7b8bf550647c7b7b
MD5 8b5f02df762e39e049effcbb34eb431d
BLAKE2b-256 fba76c3e8af745fe2f963b15f4f7049138e018d98e9814c8907550ee616b1ef9

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 00bfef25ec782a729aeb2db9fa54e57ed492f0898662f123ba89c3b24614985a
MD5 4e7baaaa12c3de9c6d0152e6f095374f
BLAKE2b-256 c6ad2c8f7f47c41356400dbb4e37bdf05297feea49fdaa732877973f9f3f7557

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 878231bfa22c466fb011146fdd03d347a733faae69fef0a0b697ac03a8f47628
MD5 77b053409e18e5693d5ea57a0b172569
BLAKE2b-256 c6043e54aa54d341118667d57219817108bf067841cde8ba39222d096eae8db4

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0790332b1cd8e038139c4b0e5b3b3ef7f967ebef72c4a8cf3c3d74610b1aa0ff
MD5 ea40e306987566f61633e726a88e7cce
BLAKE2b-256 82b134b25481911b615d153908caac1ec176069a260b6ef641c3a5bd27d52603

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.9-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tokie-0.0.9-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 c5f6cffcabd2c3838bc39ba08b834a5dfea1b5f73f223dcfadc7009bb5390d53
MD5 edd1d726aae036987a4449a26f85f44a
BLAKE2b-256 5110baa5bdcdec0a47b8db4f7100ff41e2eacb011c5b817278f7ae516bac0c42

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d46323296393e91c65a4ec61b688d5b070e33607bb6926523f8c7a16e576713a
MD5 ba7c6ffb4664a47af8468c118eee33a8
BLAKE2b-256 091dc14ae0bc273cd47c06184a2c4bd7fb330023058f9e7b54545bbf092b0ce5

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6ab65bdbb40ffa0f54858779418fcd440b79ebfe2c13f18abb7a1b767ca94ccc
MD5 d24c77cb8b5c30050805a255292348e1
BLAKE2b-256 538b2dbef40651425d212e01ff611eaba8c5e8308b4647fc57f184f6944411b9

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a725368781e9b008de7347f56a6f24cb4a6d4df5705f328d764f70a55b9c74ff
MD5 7e2428c7974de405b2671ea9edb48431
BLAKE2b-256 8d3d9d1a64d72a6de089fda06eb0728c911ce63444d113195d0c425d9170abe3

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ee7ad68173f3b874ba967df9cbe20203461df9a60624076595320daeab53c269
MD5 c4e183ae4df6eab875ce68a870dd176b
BLAKE2b-256 2e296cd8e4954c6b014a839447289150ee60c7d0d706681fffe97076c28943c4

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.9-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tokie-0.0.9-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1e6f46a5b1abd136a52edc5078c2dedafb65c16cd38dadd1908259e11d2530ea
MD5 a6bfb15b20f26874d9c7d4bd01988121
BLAKE2b-256 f2244ee4e883ee8686819fbfd7a6c129a5071dbbdd50944ba670dfe872c4b9e1

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5fc3ca15b34ff0c08a775c2d3a62d3bab7b753a05c241a6e7065785b9afd679e
MD5 62628e9139e43adeba935fc7f2989b1a
BLAKE2b-256 52e7277beba1adbdeebbd332e105ebdaebb42bdbef5270fcaa1cbc5077b2efd5

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2be6b136796ae8066a6ef666c4b2f8bd29729353c30388f236a2c1950ea7bfc7
MD5 808bf5b62de3d5ba77b205f9210e22f8
BLAKE2b-256 471a003d601df91fabf454d108c5068272964deb7fe8f592a4577d4ed1fe9658

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 abb5332d2b93843fa3c59383a1c3436efb2681d09bda55f24a16c5fdfa7cce99
MD5 e94eaf42326df3ff21dba3890fa458d4
BLAKE2b-256 5f8bfc17d1bce38325c7b10e6c8c19cceea24c6446adba7b2cad74d2fe9fd1bf

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 15f3f413921d59f70ea605c93279ca578a4c49f058824aadfb57a756cf912a6b
MD5 fa0bc9bae9be743a0c48a85b188f2552
BLAKE2b-256 a5c5705ffd2f2dacfc78340660c40cad1561fd36f02cabf9e545090b23577a91

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.9-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tokie-0.0.9-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 92fcc91d5013a6e8d048545f0f4433fac88d9b6de61f5656a95167ec2dd6de39
MD5 82962eec4ecef78bb9edee221331773b
BLAKE2b-256 bcd22546912e5dc8c5405b44baf7fbaf617abec87bbe2f5503149cf3cb87dcc0

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 04d73a972aaac4fcd2caa1ffcdedc7814ce6ae4f01e218398d45178ca0135f50
MD5 d36de855f08501b2e9609a8f301c1d67
BLAKE2b-256 597748e2241f6b3b82e6ce13a76de9f319623dfdd81385716396990cd0376112

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 004382c9fdc82bee94423fae26d0e6623c16def5d37911fbf29c066e91688a67
MD5 48c0f2777cd2e9131f1f1e726e785acc
BLAKE2b-256 34ca8a536fe6469cdba42238ea0e1d32e0208dfb0de1c2b7f3537e55e8938ddb

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0eee574009dc82a49a6cbb3dc4e28001b8d5c1f1360782f4886108503e8b59d9
MD5 3d644ead04bd5ca79b6a60a4db28b4fd
BLAKE2b-256 80cfeb30ba2187a27dd36b508a52e9e074acb38af5acbf67ef56ee754a201309

See more details on using hashes here.

File details

Details for the file tokie-0.0.9-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.9-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e4e3b24ccfdb9e41099d4f94a8b8ef373387b8ef2ecd1371260fdec52b7d0984
MD5 f9acc0b1499c6aaf1347c2c393b83233
BLAKE2b-256 babf095b8255125fb4fcdc10a3668fb584e627c0993e861ecf413c5b559a8520

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page