Skip to main content

Blazingly fast tokenizer — 50x faster, 10x smaller, 100% accurate

Project description

tokie

10-136x faster tokenization, 10x smaller model files, 100% accurate

GitHub · crates.io · HuggingFace


tokie is a fast, correct tokenizer library built in Rust with Python bindings. Drop-in replacement for HuggingFace tokenizers — supports BPE (GPT-2, tiktoken, SentencePiece), WordPiece (BERT), and Unigram encoders.

Installation

pip install tokie

Quick Start

import tokie

# Load from HuggingFace Hub (tries .tkz first, falls back to tokenizer.json)
tokenizer = tokie.Tokenizer.from_pretrained("bert-base-uncased")

# Encode — callable syntax or .encode()
encoding = tokenizer("Hello, world!")
print(encoding.ids)               # [101, 7592, 1010, 2088, 999, 102]
print(encoding.tokens)            # ['[CLS]', 'hello', ',', 'world', '!', '[SEP]']
print(encoding.attention_mask)    # [1, 1, 1, 1, 1, 1]
print(encoding.special_tokens_mask)  # [1, 0, 0, 0, 0, 1]

# Decode
text = tokenizer.decode(encoding.ids)  # "hello , world !"

# Token count (fast, no Encoding overhead)
count = tokenizer.count_tokens("Hello, world!")

# Batch encode (parallel across all cores)
encodings = tokenizer.encode_batch(["Hello!", "World"], add_special_tokens=True)

Padding & Truncation

# Truncate to max length (special tokens preserved)
tokenizer.enable_truncation(max_length=32)

# Pad all sequences in a batch to the same length
tokenizer.enable_padding(length=32, pad_id=tokenizer.pad_token_id or 0)

# Batch encode — all sequences same length, ready for model input
texts = ["Hello world", "Short", "A much longer sentence for testing"]
batch = tokenizer.encode_batch(texts, add_special_tokens=True)
for enc in batch:
    print(len(enc), enc.ids[:5])  # All length 32

Pair Encoding (Cross-Encoders)

pair = tokenizer("How are you?", "I am fine.")  # or tokenizer.encode_pair(...)
print(pair.ids)                # [101, 2129, 2024, 2017, 1029, 102, 1045, 2572, 2986, 1012, 102]
print(pair.type_ids)           # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
print(pair.special_tokens_mask)  # [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]

Byte Offsets

enc = tokenizer.encode_with_offsets("Hello world")
for token_id, (start, end) in zip(enc.ids, enc.offsets):
    print(f"  token {token_id}: bytes [{start}:{end}]")

Save & Load (.tkz format)

tokie's binary .tkz format is ~10x smaller than tokenizer.json and loads in ~5ms:

tokenizer.save("model.tkz")
tokenizer = tokie.Tokenizer.from_file("model.tkz")

Supported Models

Works with any HuggingFace tokenizer — GPT-2, BERT, Llama 3/4, Mistral, Phi, Qwen, T5, XLM-RoBERTa, and more.

Benchmarks

Model Text Size tokie HF tokenizers Speedup
BERT 900 KB 1.69 ms 229 ms 136x
GPT-2 900 KB 1.70 ms 181 ms 107x
Llama 3 900 KB 2.04 ms 190 ms 93x
Qwen 3 45 KB 0.15 ms 8.18 ms 54x
Gemma 3 45 KB 1.01 ms 9.62 ms 10x

100% token-accurate across all models.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokie-0.0.6.tar.gz (140.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokie-0.0.6-cp313-cp313-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.13Windows x86-64

tokie-0.0.6-cp313-cp313-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

tokie-0.0.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tokie-0.0.6-cp313-cp313-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tokie-0.0.6-cp313-cp313-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

tokie-0.0.6-cp312-cp312-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.12Windows x86-64

tokie-0.0.6-cp312-cp312-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

tokie-0.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tokie-0.0.6-cp312-cp312-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tokie-0.0.6-cp312-cp312-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

tokie-0.0.6-cp311-cp311-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.11Windows x86-64

tokie-0.0.6-cp311-cp311-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

tokie-0.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tokie-0.0.6-cp311-cp311-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tokie-0.0.6-cp311-cp311-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file tokie-0.0.6.tar.gz.

File metadata

  • Download URL: tokie-0.0.6.tar.gz
  • Upload date:
  • Size: 140.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.6.tar.gz
Algorithm Hash digest
SHA256 1d936c8012b7a9cf4ea07871e2e432beee679ae2623cabcdd534ab2750d7c33b
MD5 2224d3c37ecfcb857603d557469fdf48
BLAKE2b-256 ae8ac0096f32d85433af3f62980bf3d7a05f59ef6ddbb662c2263440aab5e785

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.6-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.6-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 598d1e6081d7ecdb5cba5322f0c46a43900fb3f2ffaa408e40252fa21ecfc123
MD5 b0466f510cd40c482d5d9d8e464ebd43
BLAKE2b-256 afc611090236d20cbb0aa67c4ea64731b9769e99cdce34ef9fe0acba413aa081

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6029055a31517c3d3549f527e3fc19bc22383aa98a0bbe8ba8bb446f507f835a
MD5 4d38a7223a9789fd6305878b8ade5d73
BLAKE2b-256 95979a0c2a64281e15c971702e8e2447fe5e233c3afcf10c75d184b2c6219811

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8360e4bf237c7e038c1461a873273cdd4b87d1d8ae56d9b3cf6d485505d03ddf
MD5 d91ec5820a69a011ede3ab9c8fd8f896
BLAKE2b-256 6d82fa786346f5846cc715d28de9df6dc1fbd504e4a96c58bfdf1f71a8203f1d

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8f5afb13352481026d1f227a875d0fd074b0085a5bf54872510a444bf060bc4e
MD5 96af06b90dec3e1e0ac9c5dbb55277b8
BLAKE2b-256 63906aeb0984f3459a0874be6c8df22e90dc4a995e0705bf280b246fa1ade756

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2422dcf2dd49fa4fbb0501b10ce98f0e2ad1fc7f456e4efe14f48ccc60ded8f5
MD5 fa4ccc19a0e68ae9b7f8928b1d8c3d82
BLAKE2b-256 5f8212c20914c556f6b271588a031169d04ccb709df122ce75ba0103660cf0dd

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.6-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 b406bec547dd27ac2663bf2ed8dbe27f22c6429c1b81e74f4bb6caf35ed856e6
MD5 9226609a4e43e397ebe2370afe74e24b
BLAKE2b-256 0b4700f716f2123ad62930e6a03971570835be2b6b251a4c393913ce3af93452

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f735035d9080ad73a14ccd4fc4f54d837a04ea7490afa61f456fe2980feab181
MD5 dfd9009d7b88a47e62a5dd6451fb2bd3
BLAKE2b-256 c83e4035ca70f2658ae223f7550fbd0b7a55d2fbc1f2262ce0c7c2272d0b5062

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5451169eeff8e3bd09d5004765346523f18f29cbef19d042ff50a43734718c1e
MD5 451404975074f62ade046a47684b54fe
BLAKE2b-256 c61aec1adaf29e77685a6c846d4d4229dae4d0679d8ef9b1d0c615d7e6394eff

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 239416dd399b4e3737297b3b9e4e3c11539f4ca49f3b7632e84200c096519473
MD5 88b532b08a834b1ca478b3bed07d819d
BLAKE2b-256 2e2ee45f108a28b2ae8533eaae3cb0497692f535c2cdf860a1b330a19c2d254e

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 412e397a0298e40d8b172b4def58f4e398e3e93ff8ff686c352ee493a4a38801
MD5 b98dd87736b83426783568cb2d68f7ce
BLAKE2b-256 afd24237b5e126e194bfbf528c62f594306e655149ab9220d4a6cc4a20647bc7

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f0ead9290710b36dd5190fe4ecaa43bda105410bd947ed2a7d4c0f074f6011ae
MD5 92d0f49cc9d23b127ef41c615b3c069e
BLAKE2b-256 ee120a503501c6a3ee16960fab15cc794408a2f4001667f853a54364e42a9507

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5a4196f09fe0de3d9d0cb9b58e60bd2143331a4e9c7699c59b1e457092dc6dfd
MD5 9b070a2f048a8b2ff62cffeae84283f9
BLAKE2b-256 8b16fbee72c6e77b9fdd8e4be3d3b0c5a80331c7c6f202402d71b0d4ce05ff81

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 729fc399418fe8f682058a62e03068b05e3ba9960390e88988d11f0f6f0b18a5
MD5 09a62d3a04c4f928cd7e2951eb7ec107
BLAKE2b-256 d3e0a7e88f1a225d4dc1357bd9a542f4d84e15d12d23ec56a62b04d710553178

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 41b35d07cf88969913bea9a3e5b477165c66ab04572372ba10eb6c772b3fd0be
MD5 b8b6d36bc940379adfc00ad458da69c4
BLAKE2b-256 440539ed6375cf9943e358578f2f0151f899193b935cdaf195cf97917ae2cfe6

See more details on using hashes here.

File details

Details for the file tokie-0.0.6-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.6-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6392bdbac83f0e4fcf7cb881a60658ec63e28a3c2f6440ed5afd6d9c05cf6001
MD5 3a32ea5b6a0d874ef3107413edb930cb
BLAKE2b-256 42365d8acae8809785a84acadae859abda0ed060e9de2f56d14eedf7751e43d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page