Skip to main content

Blazingly fast tokenizer — 50x faster, 10x smaller, 100% accurate

Project description

tokie

50x faster tokenization, 10x smaller model files, 100% accurate

GitHub · crates.io · HuggingFace


tokie is a fast, correct tokenizer library built in Rust with Python bindings. Drop-in replacement for HuggingFace tokenizers — supports BPE (GPT-2, tiktoken, SentencePiece), WordPiece (BERT), and Unigram encoders.

Installation

pip install tokie

Quick Start

import tokie

# Load from HuggingFace Hub (tries .tkz first, falls back to tokenizer.json)
tokenizer = tokie.Tokenizer.from_pretrained("bert-base-uncased")

# Encode — returns Encoding with ids, attention_mask, type_ids
encoding = tokenizer.encode("Hello, world!")
print(encoding.ids)             # [101, 7592, 1010, 2088, 999, 102]
print(encoding.attention_mask)  # [1, 1, 1, 1, 1, 1]

# Decode
text = tokenizer.decode([101, 7592, 1010, 2088, 999, 102])

# Token count (fast, no Encoding overhead)
count = tokenizer.count_tokens("Hello, world!")

# Vocabulary size
print(tokenizer.vocab_size)  # 30522

Padding & Truncation

# Truncate to max length (special tokens preserved)
tokenizer.enable_truncation(max_length=32)

# Pad all sequences in a batch to the same length
tokenizer.enable_padding(length=32, pad_id=tokenizer.pad_token_id or 0)

# Batch encode — all sequences same length, ready for model input
texts = ["Hello world", "Short", "A much longer sentence for testing"]
batch = tokenizer.encode_batch(texts, add_special_tokens=True)
for enc in batch:
    print(len(enc), enc.ids[:5])  # All length 32

Pair Encoding (Cross-Encoders)

pair = tokenizer.encode_pair("How are you?", "I am fine.")
print(pair.ids)             # [101, 2129, 2024, 2017, 1029, 102, 1045, 2572, 2986, 1012, 102]
print(pair.attention_mask)  # [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print(pair.type_ids)        # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

Byte Offsets

# Get byte offsets for each token in the (normalized) input
enc = tokenizer.encode_with_offsets("Hello world")
for token_id, (start, end) in zip(enc.ids, enc.offsets):
    print(f"  token {token_id}: bytes [{start}:{end}]")

Save & Load (.tkz format)

tokie's binary .tkz format is ~10x smaller than tokenizer.json and loads in ~5ms:

tokenizer.save("model.tkz")
tokenizer = tokie.Tokenizer.from_file("model.tkz")

Supported Models

Works with any HuggingFace tokenizer — GPT-2, BERT, Llama 3/4, Mistral, Phi, Qwen, T5, XLM-RoBERTa, and more.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokie-0.0.5.tar.gz (123.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokie-0.0.5-cp313-cp313-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.13Windows x86-64

tokie-0.0.5-cp313-cp313-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

tokie-0.0.5-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tokie-0.0.5-cp313-cp313-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tokie-0.0.5-cp313-cp313-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

tokie-0.0.5-cp312-cp312-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.12Windows x86-64

tokie-0.0.5-cp312-cp312-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

tokie-0.0.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tokie-0.0.5-cp312-cp312-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tokie-0.0.5-cp312-cp312-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

tokie-0.0.5-cp311-cp311-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.11Windows x86-64

tokie-0.0.5-cp311-cp311-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

tokie-0.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tokie-0.0.5-cp311-cp311-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tokie-0.0.5-cp311-cp311-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file tokie-0.0.5.tar.gz.

File metadata

  • Download URL: tokie-0.0.5.tar.gz
  • Upload date:
  • Size: 123.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.5.tar.gz
Algorithm Hash digest
SHA256 1d9adfdaf4d3ff052a31add37fbbc5f24372fec44ecd13004c52cc5da597ec8c
MD5 af7bef993181def9ba56c61ac0e1f76f
BLAKE2b-256 9a1048660d8919f6204ffbfb6fd0988220ea09792778973362e3ae851904bc21

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.5-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 da62014dc4dbb8c92be31c60b6aebc1d3bc6ffb68cc275d88a4756dbeb774aa2
MD5 dd7527d791c6386268ece7f42df22d57
BLAKE2b-256 49981c3e62992b37b494ce74b92a2e6de69f8617e7dc5956d97e5b502201f915

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ba6e5c5438748a07d6d9765b204f947c7d62b6376d0b28fb373da6c33987df91
MD5 92ee50a87d0059b97ac18e07b317aef6
BLAKE2b-256 9bcdf0bef76f6b3f3813815b5d1638bf8802b8521d77f416df4b406e4447bcd4

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c44e4262f035c4c6526ac2fe4dfba79e16c9a7ef6bee197d8af3722151fab279
MD5 68fe92d6dd082c0432b29307e69e146e
BLAKE2b-256 d70f69769ecb204ac3e84588dde6e9ad9d29035e5892e9a42322e142396847bc

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bc11a6b684f5710e7278c5fe2410086c6c3a90f298a57161e84e4b8ed69ed64b
MD5 a7df29ed5867d8d665d9b2fc782695a0
BLAKE2b-256 1ebfaa0187fdd46bccda79194f48449e8e0cd464e8e112fef8f075317cca9fd2

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 15293f74c3bdc7703a0469bd0208e275731a9e7b2bb94b60ec7dd55ac506b2d9
MD5 5f6fb394007f537c0ab514dfcccf83a6
BLAKE2b-256 468a8d5bb04b037c21b1129413e59f9e7c67467402a2827164cf64bb48cd2510

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5074fe374ee76c2c3a3eb8e3c5bd7d898757b5027d92648d11151424c85085fb
MD5 002eb76d8ec59c413791896a78f2136a
BLAKE2b-256 5ec47b39535bfb2aef313a005b0982b1ba6763a5f70193e8805cd7e73fc97c96

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f3b40d58b315d08b850b1fad3f6cbf2e1c92612339ab3560b089639beac24ba2
MD5 24c10353ef22871a54ff1095b9bd873c
BLAKE2b-256 96d2d77c7373080691a07392760dac7f5f5cf97b39c3377c0568ca22b1edb56c

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5d935715f9dfc7a3eea7e74f12d2eae5cd7c712c67e45fb3e0439b558ef08950
MD5 da134704f02bc825a0e7c7ca7791972e
BLAKE2b-256 65f42d367997dd160258b1b504a31eefae2c51268080dd9a29a12dbfc669d54c

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9f5d5852c1797a8c56d37a2f04ef8ef69f880e3d45aa0547607332bef89eebac
MD5 b3581b8fd78cc6cfd3c681f0567fb9ec
BLAKE2b-256 cebc23909d1da404db17e89e08fb20c9c3ee898f7b70933ee4e12ed058b06c16

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1b0fb2577e31fce6c2fa27c88a2890a36efadd8a7f44e3eb5eccc609946f21bf
MD5 28c73f1b4de6063b807b87d7417380a7
BLAKE2b-256 1f44519f359293946e5c2038e2f1540d534e13545974921bb4933292fc2d75b2

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8e2610f39141d15a54bb9d27b027e86004cd3ddde0dc3f60f8e342a243d41a97
MD5 c90dde8d955380a652851280270e8f12
BLAKE2b-256 1bb5397e8419036271b1405bb3ce1722e18f2c07f1663008cbcb76b1f4a75892

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 bea0cb413fd3dcb19435b897a11d2430290f436b407e5a896c64db4982af1fab
MD5 e3383166ad9d26e0e754aadc8fe8c7b8
BLAKE2b-256 60f6b3f18c6b4e2c67b5cd2944ab111e006accb6ff10e4dc3d9602bf317a8ed5

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7d24bd34e694f944a4c635fd17074d5cf34fd044cee8e527c1ba3d9aa84d13a8
MD5 d64af7f804bf6dead8969759f402d515
BLAKE2b-256 aa2a50e35bc29b68d69b2420331b18b93e5262f96af03bcfb8a55d36e5beddbf

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 90f0bcef07df483f433e221787b8691b61d60098719cf6bdb51ea59e3b61780d
MD5 4792b5d7b5e65854fed020ee4db43212
BLAKE2b-256 109bae26c8a70257248de52902b17d3968b987d52425a274367a98eff5de9987

See more details on using hashes here.

File details

Details for the file tokie-0.0.5-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.5-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c028cf474ae9862cf3dde3bacdf025e982f0bd4d3e20f1dc30d600f0ffc816ef
MD5 5622920f0b669f5a713252658b5f2286
BLAKE2b-256 bf100014e7580fe376b143112493063c7e4a1adb28caa8429bec5782edc2918d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page