Skip to main content

Blazingly fast tokenizer — 50x faster, 10x smaller, 100% accurate

Project description

tokie

10-136x faster tokenization, 10x smaller model files, 100% accurate

GitHub · crates.io · HuggingFace


tokie is a fast, correct tokenizer library built in Rust with Python bindings. Drop-in replacement for HuggingFace tokenizers — supports BPE (GPT-2, tiktoken, SentencePiece), WordPiece (BERT), and Unigram encoders.

Installation

pip install tokie

Quick Start

import tokie

# Load from HuggingFace Hub (tries .tkz first, falls back to tokenizer.json)
tokenizer = tokie.Tokenizer.from_pretrained("bert-base-uncased")

# Encode — callable syntax or .encode()
encoding = tokenizer("Hello, world!")
print(encoding.ids)               # [101, 7592, 1010, 2088, 999, 102]
print(encoding.tokens)            # ['[CLS]', 'hello', ',', 'world', '!', '[SEP]']
print(encoding.attention_mask)    # [1, 1, 1, 1, 1, 1]
print(encoding.special_tokens_mask)  # [1, 0, 0, 0, 0, 1]

# Decode
text = tokenizer.decode(encoding.ids)  # "hello , world !"

# Token count (fast, no Encoding overhead)
count = tokenizer.count_tokens("Hello, world!")

# Batch encode (parallel across all cores)
encodings = tokenizer.encode_batch(["Hello!", "World"], add_special_tokens=True)

Padding & Truncation

# Truncate to max length (special tokens preserved)
tokenizer.enable_truncation(max_length=32)

# Pad all sequences in a batch to the same length
tokenizer.enable_padding(length=32, pad_id=tokenizer.pad_token_id or 0)

# Batch encode — all sequences same length, ready for model input
texts = ["Hello world", "Short", "A much longer sentence for testing"]
batch = tokenizer.encode_batch(texts, add_special_tokens=True)
for enc in batch:
    print(len(enc), enc.ids[:5])  # All length 32

Pair Encoding (Cross-Encoders)

pair = tokenizer("How are you?", "I am fine.")  # or tokenizer.encode_pair(...)
print(pair.ids)                # [101, 2129, 2024, 2017, 1029, 102, 1045, 2572, 2986, 1012, 102]
print(pair.type_ids)           # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
print(pair.special_tokens_mask)  # [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]

Byte Offsets

enc = tokenizer.encode_with_offsets("Hello world")
for token_id, (start, end) in zip(enc.ids, enc.offsets):
    print(f"  token {token_id}: bytes [{start}:{end}]")

Save & Load (.tkz format)

tokie's binary .tkz format is ~10x smaller than tokenizer.json and loads in ~5ms:

tokenizer.save("model.tkz")
tokenizer = tokie.Tokenizer.from_file("model.tkz")

Supported Models

Works with any HuggingFace tokenizer — GPT-2, BERT, Llama 3/4, Mistral, Phi, Qwen, T5, XLM-RoBERTa, and more.

Benchmarks

Model Text Size tokie HF tokenizers Speedup
BERT 900 KB 1.69 ms 229 ms 136x
GPT-2 900 KB 1.70 ms 181 ms 107x
Llama 3 900 KB 2.04 ms 190 ms 93x
Qwen 3 45 KB 0.15 ms 8.18 ms 54x
Gemma 3 45 KB 1.01 ms 9.62 ms 10x

100% token-accurate across all models.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokie-0.0.7.tar.gz (141.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokie-0.0.7-cp313-cp313-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.13Windows x86-64

tokie-0.0.7-cp313-cp313-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

tokie-0.0.7-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tokie-0.0.7-cp313-cp313-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tokie-0.0.7-cp313-cp313-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

tokie-0.0.7-cp312-cp312-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.12Windows x86-64

tokie-0.0.7-cp312-cp312-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

tokie-0.0.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tokie-0.0.7-cp312-cp312-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tokie-0.0.7-cp312-cp312-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

tokie-0.0.7-cp311-cp311-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.11Windows x86-64

tokie-0.0.7-cp311-cp311-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

tokie-0.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tokie-0.0.7-cp311-cp311-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tokie-0.0.7-cp311-cp311-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file tokie-0.0.7.tar.gz.

File metadata

  • Download URL: tokie-0.0.7.tar.gz
  • Upload date:
  • Size: 141.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.7.tar.gz
Algorithm Hash digest
SHA256 c74b9ac0dfdf19f2a07563e212601b9fa32091e0047db1d5bfcf6adabff0637f
MD5 05713e61eb8c5ccdbc640f68df9ba8f4
BLAKE2b-256 376bd491dd32bd6eb66ecaddf08f4d78157f4a571ed44f66a2fdf47885acef50

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.7-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.7-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 e2ad08eef813abd09d6237d6846a43fa667f16663d908dad353008e9abf1332e
MD5 f722b6f34ebbf3a94c9e35b2dc5318fa
BLAKE2b-256 d5edaa5f6f3d22cad3f064e41b21aa586ca909536e6f58891c616a91ea30dc9e

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8eaf80cc7ed3e4bb81239275386a34ba62819698738489601d1e2e8d0cd692f5
MD5 4e4fbbcafd3d31d454dda9b221d286df
BLAKE2b-256 b2369a160e2d4edeea9c4a48c434fb8ef6fad38ea04b17b10d352ddbf412cf68

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7c0379a498195d50dafa2690b1e3909596b537411dabb75ceffb6b0e29ba3811
MD5 bd12dd14f4ce11bf502cd9ca621ad770
BLAKE2b-256 80704b087fa6d211b220c591dfdaf296a91513905e4b4b2b5369315390adc866

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e64aaa2c7fafee5de6e8f6ec1e5bf92cbf96480d0ccf884daca34d2b7e8fab93
MD5 3d80804385d9c2cc24e52fab9295a864
BLAKE2b-256 66b85b555690bc0fbfb2ebd67954ffad73a3bf3dff18020be9f86d65fbb7f883

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 dd5ecd8a38dab1d15e69c962f271310a93714a4265d6ea6f8f42a837776bfa66
MD5 8cf154c369394b151f12e1f7d2d64742
BLAKE2b-256 5bff4a8ae9257d9d96baba1636e160a98d9e5d9b15b8699bc09d2469bf120e02

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.7-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6b705e9d4b2585fedc0f3762448b079ba5df4f308938c0d6dd154a98e532c055
MD5 9f957e94c2a0af5a6d8afa3f85246786
BLAKE2b-256 6b395c793cc06eb1654835890f2f21ed38a6a29e48a18d1dcfb9f39a423a0b43

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f5e6a7769cbcb34db2f6d8623406e544e7c9b246e8ebee3d8b78d1be6beebf5d
MD5 6359e2387a742157343ef36926f85777
BLAKE2b-256 1fc4f2dc2164884e35518cc5cbae485f6b30a1456eb7384fae2ee65e3d3bc970

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eddc44dc352d9d7aa73b64bf1ce1db760cf633d4a3fea6d4ddada0a653afd457
MD5 f1a3b6a44ab22268c0c8913a34c09d65
BLAKE2b-256 9a113e204abf99241460ee47717939eac9bc721ced124904b9355cf430b307e3

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 28880f7668fd5b5ea4a6c7420ad2b76d5363e876a306261bc404ac427203449f
MD5 0932b3f665b59ed99214ff57ac59bedb
BLAKE2b-256 447dcdad7414e67ea3fc4043ed9c2364c470aa992edaa2aac29088bccfb2e844

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 881aa21c32dd5fb9a6b5accac82733e4c224bcb8b3cc8accba49bc905f391cb0
MD5 afe353245ab3e5ee93896ca8e721528d
BLAKE2b-256 63a5bf4300e1e2bf1c08e6be0cfd7ad1f9497a70a0a99745955cecd6c4e1e53d

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.7-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 522dfd59ac1226603c93e64cfa6c87bfab963679368da6eefbb3a7e0767cd12d
MD5 3de420ad6b541d8ef4e90d999c39d678
BLAKE2b-256 1dff138cd73c3c4fad1f7f689789fad4473f1fd07f8ebd569e9a11979cc8bf43

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f7879c12a965c5201d54418370e9271016d61141a1787e5f81f3e96f2002da1d
MD5 0b0269f62173997f5037e8f820fe1eb5
BLAKE2b-256 b936d3587854fd5f707421638bb2d9ec4ec1dc8bf5459d4405dbe9fe9bfea9a4

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9ea0a81b82d5ccce351ea5d8d9222f6e88870d41b9bc66e57aa73ec87c772548
MD5 69f5ad36ab98f4958e0341da44c627b7
BLAKE2b-256 f36ad574417801dcc7f07e891c294de661babb48a5b17c9b79da0ef92a769bb8

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 120a2b7df32f3d3eaa96e2d6f4757ab60b092bd3789bd051a2cc1a73ee3b0b53
MD5 b6aec79484ba1831a8749b5538effb19
BLAKE2b-256 596e5a175887cbdbbc20dc425022278a8d40e12a7a9447c6f7bc6d9ccda02476

See more details on using hashes here.

File details

Details for the file tokie-0.0.7-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.7-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0a7fca6113ad10fa6ae3d438ac75acf72cfbdabe3f62c8d0e6fbeb0ea3b46072
MD5 ba8505369a6f208f3e860c0779750631
BLAKE2b-256 0682faf954c8835ab684367ea17a0209b263d6b1cb34b59abc5bc6081d6e9a21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page