Skip to main content

Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization

Project description

kitoken

Tokenizer for language models.

from kitoken import Kitoken

const encoder = Kitoken.from_file("models/llama2.kit")

const tokens = encoder.encode("hello world!", True)
const string = encoder.decode(tokens).decode("utf-8")

assert string == "hello world!"

Features

  • Fast encoding and decoding
    Faster than most other tokenizers in both common and uncommon scenarios.
  • Support for a wide variety of tokenizer formats and tokenization strategies
    Including support for Tokenizers, SentencePiece, Tiktoken and more.
  • Compatible with many systems and platforms
    Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python.
  • Compact data format
    Definitions are stored in an efficient binary format and without merge list.
  • Support for normalization and pre-tokenization
    Including unicode normalization, whitespace normalization, and many others.

Overview

Kitoken is a fast and versatile tokenizer for language models. Multiple tokenization algorithms are supported:

  • BytePair: A variation of the BPE algorithm, merging byte or character pairs.
  • Unigram: The Unigram subword algorithm.
  • WordPiece: The WordPiece subword algorithm.

Kitoken is compatible with many existing tokenizers, including SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken.

See the main README for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kitoken-0.10.0.tar.gz (59.6 kB view details)

Uploaded Source

Built Distributions

kitoken-0.10.0-cp310-abi3-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10+Windows x86-64

kitoken-0.10.0-cp310-abi3-win32.whl (1.2 MB view details)

Uploaded CPython 3.10+Windows x86

kitoken-0.10.0-cp310-abi3-musllinux_1_2_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

kitoken-0.10.0-cp310-abi3-musllinux_1_2_i686.whl (1.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

kitoken-0.10.0-cp310-abi3-musllinux_1_2_armv7l.whl (1.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

kitoken-0.10.0-cp310-abi3-musllinux_1_2_aarch64.whl (1.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

kitoken-0.10.0-cp310-abi3-manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

kitoken-0.10.0-cp310-abi3-manylinux_2_28_ppc64le.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

kitoken-0.10.0-cp310-abi3-manylinux_2_28_i686.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

kitoken-0.10.0-cp310-abi3-manylinux_2_28_armv7l.whl (1.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

kitoken-0.10.0-cp310-abi3-manylinux_2_28_aarch64.whl (1.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

kitoken-0.10.0-cp310-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

kitoken-0.10.0-cp310-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file kitoken-0.10.0.tar.gz.

File metadata

  • Download URL: kitoken-0.10.0.tar.gz
  • Upload date:
  • Size: 59.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.8

File hashes

Hashes for kitoken-0.10.0.tar.gz
Algorithm Hash digest
SHA256 3152601e5fd28c751d8a4a70fb4cb8fac24637307fc58fce07c70636eb4e6d93
MD5 8fc1b552834382ca332c352e7b5ba1f5
BLAKE2b-256 13d05fd67f978c720cb9365f42ef26a49c7075ed276e887ec6a17da8909aa833

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: kitoken-0.10.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.8

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 99f54fed7bd1e3edf85c3873fc18351ffd3144eb7d63c706e880f8c3c797afd6
MD5 4c05f26d1a419249d52ef8635d6d16ba
BLAKE2b-256 563774d24b61a46cc5600f481929e0a0047873c662278ae0baae3ffde50a8443

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-win32.whl.

File metadata

  • Download URL: kitoken-0.10.0-cp310-abi3-win32.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.8

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 6b5e1f81982cbb7abbc92718c15a970bf1d13d7a4f95a968e87caa324b4bd2a4
MD5 a22ac813b19343b20b79654bb285ccc1
BLAKE2b-256 8acbd0cdc7377e8d72e3335cb433f40a083b38dd5aaa7d89b1d5528a0be16270

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 68e5ddf026d3835d9a5b22f2b95e93ce2b44379170175181d7dc29cf99414d15
MD5 14f101dff254526bf35f867c73890e0f
BLAKE2b-256 9bdb45ac2429a04eb1e136d672b44a44b2a7d4f810381c47374ab4482585276f

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 0ca5dbd0bb8381e57850c840098a64f12b03de316f1c3158a357b0555b679807
MD5 63986f3f976be857f060e5da18bd69f5
BLAKE2b-256 08fe8c13c98900c6f5cf86d801e299a9978aeec16ee5d539355091250d36d52f

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 097d271ccd06205dd6c27f46beaf18e708a69173a6d777818c8b2134b4d77bd9
MD5 3631d9b0a84605c244b51b45b568a833
BLAKE2b-256 b1641954e35280ce901b485d1acb63bf6d21ff6b16e55e1324c7d1f439fea959

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 21f7a7608f4046a7ef8b931e59c7ef29ecc73bee5dbf2ae6eb78ec88284c9bba
MD5 de23ba069037df09253569c1b01f19f2
BLAKE2b-256 e4f14264f3f97c97fe252b2998d4acc305b063a660864c8f2df3607e4619c258

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3053cf19e78ae759229f40303aad94b388bddc8c7a8df6cf8d372022c258cd29
MD5 e9e1e7d021161b995b8c1a6696ff6d3f
BLAKE2b-256 5bb236b6525d1c86fba147d62afa16a5a3ca22cada6230b5d11e3c3fe0ff04ef

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 742e304251aba958b2293c734e58bba494d265c4951967cd3d47bb5d63aad87c
MD5 80ab865fbe47a5addbe2a5871eb98ae3
BLAKE2b-256 e520c8ea490cb54765a4d084d72fb3647305a266e3bf092c6198ab7241df8f36

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 925317ca795b00f3640fd0765260ef7c071dc6ad22f59bed0486ebcee407fa34
MD5 930896d7b823b62efb05e8c085c61b3b
BLAKE2b-256 de85b00bbef73ddc3c7f7ba57d3e311f8e5718014caba16cb7e884b897ec5eae

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 f02406699c3653be0d546eb7f5164edfeb7d6b57842d7188aede856a1d4f151c
MD5 716b40e21927d95eb5658dcfc901f8ac
BLAKE2b-256 9ed14cfc181566d2f707a3d1f442d5b6394702b4a8086a60afd2d33375425f77

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 93a414ad0b28bdc6124fb7556e1e685eb278a784ce31c129c4588c3afe1587d4
MD5 fcd1493543caee0cddeb2a0135896eb9
BLAKE2b-256 2a314366ff210447437aa6addc9b7f4d5ea653826d167dadb613c41d86c497c9

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 831937386dcead31c29b5111f103632bcda17081196f6439db82a92416e5af81
MD5 30ab85c07d6d55bfb9d7bfec81f19f23
BLAKE2b-256 308c6006c83f554dbd5bdf1f5814c3e6f8471e9ba7a67a3067f4d778952e397b

See more details on using hashes here.

File details

Details for the file kitoken-0.10.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bcc4b29c5e6d8ca2506e0d4afa759c483b7b1e458fb7e977af260fd0bcd2e4e2
MD5 fa9b4cf73801ac785b4b79d00eb74f93
BLAKE2b-256 dcf48844290f4a8384efcba1604b3b435724609e2f5e89672ee37ee3226183c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page