Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization
Project description
kitoken
Tokenizer for language models.
from kitoken import Kitoken
const encoder = Kitoken.from_file("models/llama2.kit")
const tokens = encoder.encode("hello world!", True)
const string = encoder.decode(tokens).decode("utf-8")
assert string == "hello world!"
Features
- Fast encoding and decoding
Faster than most other tokenizers in both common and uncommon scenarios. - Support for a wide variety of tokenizer formats and tokenization strategies
Including support for Tokenizers, SentencePiece, Tiktoken and more. - Compatible with many systems and platforms
Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python. - Compact data format
Definitions are stored in an efficient binary format and without merge list. - Support for normalization and pre-tokenization
Including unicode normalization, whitespace normalization, and many others.
Overview
Kitoken is a fast and versatile tokenizer for language models. Multiple tokenization algorithms are supported:
- BytePair: A variation of the BPE algorithm, merging byte or character pairs.
- Unigram: The Unigram subword algorithm.
- WordPiece: The WordPiece subword algorithm.
Kitoken is compatible with many existing tokenizers, including SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken.
See the main README for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file kitoken-0.10.0.tar.gz
.
File metadata
- Download URL: kitoken-0.10.0.tar.gz
- Upload date:
- Size: 59.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
3152601e5fd28c751d8a4a70fb4cb8fac24637307fc58fce07c70636eb4e6d93
|
|
MD5 |
8fc1b552834382ca332c352e7b5ba1f5
|
|
BLAKE2b-256 |
13d05fd67f978c720cb9365f42ef26a49c7075ed276e887ec6a17da8909aa833
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-win_amd64.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
99f54fed7bd1e3edf85c3873fc18351ffd3144eb7d63c706e880f8c3c797afd6
|
|
MD5 |
4c05f26d1a419249d52ef8635d6d16ba
|
|
BLAKE2b-256 |
563774d24b61a46cc5600f481929e0a0047873c662278ae0baae3ffde50a8443
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-win32.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-win32.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10+, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
6b5e1f81982cbb7abbc92718c15a970bf1d13d7a4f95a968e87caa324b4bd2a4
|
|
MD5 |
a22ac813b19343b20b79654bb285ccc1
|
|
BLAKE2b-256 |
8acbd0cdc7377e8d72e3335cb433f40a083b38dd5aaa7d89b1d5528a0be16270
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_x86_64.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
68e5ddf026d3835d9a5b22f2b95e93ce2b44379170175181d7dc29cf99414d15
|
|
MD5 |
14f101dff254526bf35f867c73890e0f
|
|
BLAKE2b-256 |
9bdb45ac2429a04eb1e136d672b44a44b2a7d4f810381c47374ab4482585276f
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_i686.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-musllinux_1_2_i686.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
0ca5dbd0bb8381e57850c840098a64f12b03de316f1c3158a357b0555b679807
|
|
MD5 |
63986f3f976be857f060e5da18bd69f5
|
|
BLAKE2b-256 |
08fe8c13c98900c6f5cf86d801e299a9978aeec16ee5d539355091250d36d52f
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_armv7l.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-musllinux_1_2_armv7l.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
097d271ccd06205dd6c27f46beaf18e708a69173a6d777818c8b2134b4d77bd9
|
|
MD5 |
3631d9b0a84605c244b51b45b568a833
|
|
BLAKE2b-256 |
b1641954e35280ce901b485d1acb63bf6d21ff6b16e55e1324c7d1f439fea959
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_aarch64.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-musllinux_1_2_aarch64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
21f7a7608f4046a7ef8b931e59c7ef29ecc73bee5dbf2ae6eb78ec88284c9bba
|
|
MD5 |
de23ba069037df09253569c1b01f19f2
|
|
BLAKE2b-256 |
e4f14264f3f97c97fe252b2998d4acc305b063a660864c8f2df3607e4619c258
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
3053cf19e78ae759229f40303aad94b388bddc8c7a8df6cf8d372022c258cd29
|
|
MD5 |
e9e1e7d021161b995b8c1a6696ff6d3f
|
|
BLAKE2b-256 |
5bb236b6525d1c86fba147d62afa16a5a3ca22cada6230b5d11e3c3fe0ff04ef
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_ppc64le.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-manylinux_2_28_ppc64le.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ppc64le
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
742e304251aba958b2293c734e58bba494d265c4951967cd3d47bb5d63aad87c
|
|
MD5 |
80ab865fbe47a5addbe2a5871eb98ae3
|
|
BLAKE2b-256 |
e520c8ea490cb54765a4d084d72fb3647305a266e3bf092c6198ab7241df8f36
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_i686.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-manylinux_2_28_i686.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
925317ca795b00f3640fd0765260ef7c071dc6ad22f59bed0486ebcee407fa34
|
|
MD5 |
930896d7b823b62efb05e8c085c61b3b
|
|
BLAKE2b-256 |
de85b00bbef73ddc3c7f7ba57d3e311f8e5718014caba16cb7e884b897ec5eae
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_armv7l.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-manylinux_2_28_armv7l.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
f02406699c3653be0d546eb7f5164edfeb7d6b57842d7188aede856a1d4f151c
|
|
MD5 |
716b40e21927d95eb5658dcfc901f8ac
|
|
BLAKE2b-256 |
9ed14cfc181566d2f707a3d1f442d5b6394702b4a8086a60afd2d33375425f77
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_aarch64.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
93a414ad0b28bdc6124fb7556e1e685eb278a784ce31c129c4588c3afe1587d4
|
|
MD5 |
fcd1493543caee0cddeb2a0135896eb9
|
|
BLAKE2b-256 |
2a314366ff210447437aa6addc9b7f4d5ea653826d167dadb613c41d86c497c9
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-macosx_11_0_arm64.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
831937386dcead31c29b5111f103632bcda17081196f6439db82a92416e5af81
|
|
MD5 |
30ab85c07d6d55bfb9d7bfec81f19f23
|
|
BLAKE2b-256 |
308c6006c83f554dbd5bdf1f5814c3e6f8471e9ba7a67a3067f4d778952e397b
|
File details
Details for the file kitoken-0.10.0-cp310-abi3-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: kitoken-0.10.0-cp310-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.10+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
bcc4b29c5e6d8ca2506e0d4afa759c483b7b1e458fb7e977af260fd0bcd2e4e2
|
|
MD5 |
fa9b4cf73801ac785b4b79d00eb74f93
|
|
BLAKE2b-256 |
dcf48844290f4a8384efcba1604b3b435724609e2f5e89672ee37ee3226183c6
|