Skip to main content

Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization

Project description

kitoken

Tokenizer for language models.

from kitoken import Kitoken

encoder = Kitoken.from_file("models/llama3.3.model")

tokens = encoder.encode("hello world!", True)
string = encoder.decode(tokens).decode("utf-8")

assert string == "hello world!"

Features

  • Fast encoding and decoding
    Faster than most other tokenizers in both common and uncommon scenarios.
  • Support for a wide variety of tokenizer formats and tokenization strategies
    Including support for Tokenizers, SentencePiece, Tiktoken and more.
  • Compatible with many systems and platforms
    Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python.
  • Compact data format
    Definitions are stored in an efficient binary format and without merge list.
  • Support for normalization and pre-tokenization
    Including unicode normalization, whitespace normalization, and many others.

Overview

Kitoken is a fast and versatile tokenizer for language models with support for multiple tokenization algorithms:

  • BytePair: A variation of the BPE algorithm, merging byte or character pairs.
  • Unigram: The Unigram subword algorithm.
  • WordPiece: The WordPiece subword algorithm.

Kitoken is compatible with many existing tokenizers, including SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken.

See the main README for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kitoken-0.10.1.tar.gz (60.0 kB view details)

Uploaded Source

Built Distributions

kitoken-0.10.1-cp310-abi3-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10+Windows x86-64

kitoken-0.10.1-cp310-abi3-win32.whl (1.3 MB view details)

Uploaded CPython 3.10+Windows x86

kitoken-0.10.1-cp310-abi3-musllinux_1_2_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

kitoken-0.10.1-cp310-abi3-musllinux_1_2_i686.whl (1.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

kitoken-0.10.1-cp310-abi3-musllinux_1_2_armv7l.whl (1.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

kitoken-0.10.1-cp310-abi3-musllinux_1_2_aarch64.whl (1.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

kitoken-0.10.1-cp310-abi3-manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

kitoken-0.10.1-cp310-abi3-manylinux_2_28_ppc64le.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

kitoken-0.10.1-cp310-abi3-manylinux_2_28_i686.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

kitoken-0.10.1-cp310-abi3-manylinux_2_28_armv7l.whl (1.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

kitoken-0.10.1-cp310-abi3-manylinux_2_28_aarch64.whl (1.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

kitoken-0.10.1-cp310-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

kitoken-0.10.1-cp310-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file kitoken-0.10.1.tar.gz.

File metadata

  • Download URL: kitoken-0.10.1.tar.gz
  • Upload date:
  • Size: 60.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.8

File hashes

Hashes for kitoken-0.10.1.tar.gz
Algorithm Hash digest
SHA256 ecca8a68a63e11e048f8f8a6e3dabe32e914466d9f59d26ce664681c9c1f0cc5
MD5 e9e66323d9cf407d970823aac2bf5075
BLAKE2b-256 e66c807e82084c691b0a6b5a659a49761709c2a0adfe1cb5bf77708eed158c70

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: kitoken-0.10.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.8

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0f255791d9bc73228683df853693362d6100cc61986dc01c1dd299ec9e9c6eb8
MD5 ab685d6855d0ef4dd14e001a99ad460d
BLAKE2b-256 1da4da01d3e81fad306f20c68a3e1bce4e905df67ee16c29aee4cdec9accf55d

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-win32.whl.

File metadata

  • Download URL: kitoken-0.10.1-cp310-abi3-win32.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.8

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 8c2243783ef68a5422daaadea7c364f3b8e26e922fab84f5c87fa245d2949293
MD5 bc440b33f8cc8ef2129bd14daf3306e1
BLAKE2b-256 cf2b6ac41cfcd2d3348dce3c941b09828d6cc04ede398c0a62882ad50cae9d04

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e9474e6e3fbd1c71a6d76f21433247c27e6c2d92400bb10708766cb6eaea8171
MD5 bd8e910d6a1a4c31f7f6a38c5b88b33d
BLAKE2b-256 ce3a13f22496bad7f4ec4a3cb8333bc211b031386716697356374a31998f83e1

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 84fde9161ede36fcf000178eaaa0c501e6a28fe9be8d9a489de9533b7035ea34
MD5 1876fe65b374bd8caf335c0cb38a2c13
BLAKE2b-256 466322ba3ee485dfe542d5bdc20792e5ec2fba4c48f866d64ab3243b1c7bad5e

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 e98044200b329ed9e1a59dcb031e00d013b838786331c4fbed1916828a121b1b
MD5 8771e9caf7277ec60ca1f09ca46b81d1
BLAKE2b-256 8d25992213aefcba4f289e5f00d0dd6c2234bf5a7b5096ad3a5e8759ef3c22e3

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 1d43d595ca405a9ea85ee532c09bd615f7745fff762d61eadbe3c249da52e05e
MD5 9e0f428c9a624f0c0e3dd5f7e449a4ae
BLAKE2b-256 93695c7a83a2da768836c0a37e254d116f859f6ac2ceab344a1902021ee8de78

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8c781e0a07ae7fe9a4bb633bda705fcb323454f694dbee12d08bca898d4ae3bb
MD5 3197042919ebe4ef3d1ee4e4e9d42a5b
BLAKE2b-256 343e15a79166cf69cad15696404696b70e878aa45e21ab87b1efa91a29321452

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 82346b7aedbf6b976cd2ee6eaca8777136ed2831e333b90800724086ebaf56b8
MD5 34d87a92bc4f766b33b9b019affd4806
BLAKE2b-256 b4bbc5c0092b219f5750f9ed5c2c27640c4765794b211304c21f03937ef84da3

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 3dbda305b095e1b896a1145fad1a6739cfd83c974a8eebbfa1fe4484cd4ae412
MD5 cf4670a72e3fc835fe6ae9d7f57276fe
BLAKE2b-256 1f10628393bd6ba59a4959c8df0fafb5d5ba3d0a90163018c7398e6e89225acd

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 617201015aec6d3d76bf7cd26a8b6de997e59ca1d6a0b5fd24578e406a36828f
MD5 7e1593473521991d705c917bdf2bc37c
BLAKE2b-256 7fafd23e0b03a836c8b7a61d029a27aa98a09b28abf20b44d3a9dc55af4f6322

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7a6a1d7b9220ae09d6022f5b408f55cb5e2fe2fa7984cb993461a7db1cce3d70
MD5 8602229357137c4154936cb977b7522b
BLAKE2b-256 d26dadb3a6b2b363033d27bf7560d63e62efcc0c3589a9ee91ed341290e1be47

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ead6759319e6d799eaa39b2d4a88042e55f8068e256772be9a05c99a0772a896
MD5 e1c3705a6632fcf3f42287a761a90b43
BLAKE2b-256 7a9bda0915143d3b45ee8331e20dd00b9924e1413e3bcca731b3a2953fdf1eae

See more details on using hashes here.

File details

Details for the file kitoken-0.10.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kitoken-0.10.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e8535845d96746951c105a3ce1ddba0f7a2a70c39aa4405261dc321d06c953c3
MD5 8cf043402d474dde8e2084c353d5d12b
BLAKE2b-256 46ef9d4447541f23a71c49c65c83ebffe034e468d2c431b359934b2685d35ab9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page