Skip to main content

Blazingly fast tokenizer — 50x faster, 10x smaller, 100% accurate

Project description

tokie

50x faster tokenization, 10x smaller model files, 100% accurate

GitHub · crates.io · HuggingFace


tokie is a fast, correct tokenizer library built in Rust with Python bindings. Drop-in replacement for HuggingFace tokenizers — supports BPE (GPT-2, tiktoken, SentencePiece), WordPiece (BERT), and Unigram encoders.

Installation

pip install tokie

Quick Start

import tokie

# Load from HuggingFace Hub (tries .tkz first, falls back to tokenizer.json)
tokenizer = tokie.Tokenizer.from_pretrained("bert-base-uncased")

# Encode — returns Encoding with ids, attention_mask, type_ids
encoding = tokenizer.encode("Hello, world!")
print(encoding.ids)             # [101, 7592, 1010, 2088, 999, 102]
print(encoding.attention_mask)  # [1, 1, 1, 1, 1, 1]

# Decode
text = tokenizer.decode([101, 7592, 1010, 2088, 999, 102])

# Token count (fast, no Encoding overhead)
count = tokenizer.count_tokens("Hello, world!")

# Vocabulary size
print(tokenizer.vocab_size)  # 30522

Padding & Truncation

# Truncate to max length (special tokens preserved)
tokenizer.enable_truncation(max_length=32)

# Pad all sequences in a batch to the same length
tokenizer.enable_padding(length=32, pad_id=tokenizer.pad_token_id or 0)

# Batch encode — all sequences same length, ready for model input
texts = ["Hello world", "Short", "A much longer sentence for testing"]
batch = tokenizer.encode_batch(texts, add_special_tokens=True)
for enc in batch:
    print(len(enc), enc.ids[:5])  # All length 32

Pair Encoding (Cross-Encoders)

pair = tokenizer.encode_pair("How are you?", "I am fine.")
print(pair.ids)             # [101, 2129, 2024, 2017, 1029, 102, 1045, 2572, 2986, 1012, 102]
print(pair.attention_mask)  # [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print(pair.type_ids)        # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

Save & Load (.tkz format)

tokie's binary .tkz format is ~10x smaller than tokenizer.json and loads in ~5ms:

tokenizer.save("model.tkz")
tokenizer = tokie.Tokenizer.from_file("model.tkz")

Supported Models

Works with any HuggingFace tokenizer — GPT-2, BERT, Llama 3/4, Mistral, Phi, Qwen, T5, XLM-RoBERTa, and more.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokie-0.0.4.tar.gz (115.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokie-0.0.4-cp313-cp313-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.13Windows x86-64

tokie-0.0.4-cp313-cp313-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

tokie-0.0.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

tokie-0.0.4-cp313-cp313-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

tokie-0.0.4-cp313-cp313-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

tokie-0.0.4-cp312-cp312-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.12Windows x86-64

tokie-0.0.4-cp312-cp312-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

tokie-0.0.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

tokie-0.0.4-cp312-cp312-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

tokie-0.0.4-cp312-cp312-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

tokie-0.0.4-cp311-cp311-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.11Windows x86-64

tokie-0.0.4-cp311-cp311-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

tokie-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

tokie-0.0.4-cp311-cp311-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

tokie-0.0.4-cp311-cp311-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file tokie-0.0.4.tar.gz.

File metadata

  • Download URL: tokie-0.0.4.tar.gz
  • Upload date:
  • Size: 115.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f4e8a8961fce4d95282da59574b32d8bcb6bcaaae72d43acf5780e7597214822
MD5 3fc5bb056c3a1522918eb77d0b5e9abc
BLAKE2b-256 7d444df53624fef1d2dec04c29e69d02f5ba3fe7c9bd2e22feaf89624a11c040

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 03ac9c0612d7d2d8a2f93692a8cefce1fa07ca2f76401bac6e7b99072cd25bb3
MD5 148310fa2b70768a430cce3c64a0b9c8
BLAKE2b-256 280eecb3bff2cebc0e3043a794f1c1f63c1fa48def03d971750326e13458e2fa

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 03fc9eef0e0a1b1568aa33cea56b5fd886608fdfdab2e77ac69b97da26026637
MD5 732e3cdfff033afa5de5f559568995fe
BLAKE2b-256 a713bd482633a812cd1b7c65e90c0708cb0756a2395782e268c8fed2d9d94dc1

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c293f28942c1fdc6d5542c93d2afc4c8aabf4a46cac9160287eab19133131f9
MD5 ca7c8b0fd07da969f15501cd85e0d33c
BLAKE2b-256 f41c9bc79192c9bb052c25a35bf2b9468f51eaf41b490302ccec9f78c589d706

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fdf8081a747c14103b8b1d60cf97a5560a47423d4fca1f1b508f00127aeb34fe
MD5 21cf3cef56f32857f16860b14e1bb83f
BLAKE2b-256 f90cc1941c62749a9825d8cb9e0bfbad04ecc2a0f9e0bcd8d0b9ca0d7ddc49be

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 cecc1f6b710d6b2d88bb9464e911133a6dd993df1a2b83a1dae480b69111095d
MD5 7848393eb8827a1a77c9ce5a3ef49d75
BLAKE2b-256 afc53e630b496fed283f4366ec2b3d90fde53d8b804edb09354a411dc993ce11

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 806644c6d4a9a05f355d3ab448b7fed2923605cb237588ec4e12b54a4642dc37
MD5 3fc7ef5669e6885e027274030882fb97
BLAKE2b-256 72385e54783dcefc3dea62ad287664098f8fd404321c1759a08e656c99503064

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4bf377a6ea4596a7def3d39954305f46ee170a7b7724f7428e8f17242145ff2b
MD5 059dac5c997452d8317fa60f68015fd5
BLAKE2b-256 d754d021063288106e5b09212ea3da9605f2ec80583c4e08bf28c9fb92959149

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f7b28138b3f037affb2097f6ab2d3634827ee3f688219af2f6b6dd517072b493
MD5 97c740cf3739d7e0612da9ea129b3f36
BLAKE2b-256 be887697c1f98697795e81b72c5717bd053fd8ec8cb3c609fbd517e0138fa43f

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c6a2b029971b5f3320cade8276656631a937c5cf270630a3cec78df7b3a26481
MD5 aa3ac0ba4072992b867d125a9e88d0f0
BLAKE2b-256 893ece96281d9f181c7c75791b2b37724d9169a07372afb229c26ccfd6ee8076

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 278ba0576946240628ad3a87a2f1ed4046561d3c708e90948dd91e7efd232ffd
MD5 82b9e27e42af530e9d75b62c9f002281
BLAKE2b-256 c01e78acd45640659e07236455fdc6809001c14d0ba9b639529a4ab455de8352

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tokie-0.0.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokie-0.0.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 75a33185723bfb7957afb076c729f494f395b347c69538a38b033c60193de2d6
MD5 ab0301dfad11876957ff8d6ee6139404
BLAKE2b-256 03e7bc0db4d3404491e8960a421f492142eaf9a1933b1c472604d9a137218258

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 cbe8a837566bfb2ea188eb66186f158c3cf11f88f24ee0b4138d5af1aa1ed4b0
MD5 3923ba92d563b9b0a572fbafa181bcc5
BLAKE2b-256 c74b276a1ecb08226876738d12163efa057ad721bb3989ccc0c7717910ae5cf7

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 98445cca692dbf3078a565782b5ca0cb3b349ff88481f356cfdd0a7e4462a3ee
MD5 8af067ecabfa3454e200a460c1c1703a
BLAKE2b-256 6adac2074c79df7fe3670331df82e21d8cef2f006dbf958711b4c8dc26035891

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 07847d20b59dbdf1980c63bd3426e9f6c2c96188a1202faaa5a7de8479459a11
MD5 9d4dd0d2800a1533bfaaa310346339f5
BLAKE2b-256 e43a3e6ea8baeadd888f49e275f7d933e4ad92c575eba9ecc1a2025f0b40a3fa

See more details on using hashes here.

File details

Details for the file tokie-0.0.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokie-0.0.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 cf6a4d33b624705aec03e8cb19ba0ad41e6c864a0e13242e0d6d142f3828268b
MD5 58a948f9df287eca46dee79fc7af19c4
BLAKE2b-256 294da4794b2ad09f549b2a7dd2d10f016142258d879a76b9eda24694e1337a9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page