Skip to main content

No project description provided

Project description

🪙 toktkn

toktkn is a BPE tokenizer implemented in rust and exposed in python using pyo3 bindings.

from toktkn import BPETokenizer, TokenizerConfig

# create new tokenizer
config = TokenizerConfig(vocab_size: 10)
bpe = BPETokenizer(config)

# build encoding rules on some corpus
bpe.train("some really interesting training data here...")
text = "rust is pretty fun 🦀"

assert bpe.decode(bpe.encode(text)) == text

# serialize to disk
bpe.save_pretrained("tokenizer.json")
del(bpe)
bpe = BPETokenizer.from_pretrained("tokenizer.json")
assert(len(bpe)==10)

Install

Install toktkn from PyPI with the following

pip install toktkn

Note: if you want to build from source make sure cargo is installed!

Performance

slightly faster than openai & a lot quicker than 🤗!

alt text

Performance measured on 2.5MB from the wikitext test split using openai's tiktoken gpt2 tokenizer with tiktoken==0.6.0 and the implementation from 🤗 tokenizers at tokenizers==0.19.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toktkn-0.1.0.tar.gz (44.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

toktkn-0.1.0-cp310-abi3-win_amd64.whl (302.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

toktkn-0.1.0-cp310-abi3-win32.whl (285.9 kB view details)

Uploaded CPython 3.10+Windows x86

toktkn-0.1.0-cp310-abi3-musllinux_1_2_x86_64.whl (670.0 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

toktkn-0.1.0-cp310-abi3-musllinux_1_2_i686.whl (689.3 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

toktkn-0.1.0-cp310-abi3-musllinux_1_2_armv7l.whl (748.6 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

toktkn-0.1.0-cp310-abi3-musllinux_1_2_aarch64.whl (663.9 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

toktkn-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (499.1 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

toktkn-0.1.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (574.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

toktkn-0.1.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (549.8 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

toktkn-0.1.0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (518.0 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ i686

toktkn-0.1.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (487.2 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

toktkn-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (488.5 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

toktkn-0.1.0-cp310-abi3-macosx_11_0_arm64.whl (432.8 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file toktkn-0.1.0.tar.gz.

File metadata

  • Download URL: toktkn-0.1.0.tar.gz
  • Upload date:
  • Size: 44.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for toktkn-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d26fe77a98f803343663c3e07735296c5f7163c1d346b89a11fade0cae5c691d
MD5 009c0de44f74787e3942203a9229b2a4
BLAKE2b-256 61f1385d3ad17ab828d8c1313092628ac024e5f13bd9e7aa4a043c096e82ec34

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: toktkn-0.1.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 302.8 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 770795f50c66c0fdfc29e9ca3a572e90c1c57f68e5f33fb1564636f153eaeaf5
MD5 7806fbac5c9df128849c27dfc501f22f
BLAKE2b-256 cf40b0d5a31c58c9f5a8cf064a52d8416ef841e6390c4d55523e5f971dcf9e04

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-win32.whl.

File metadata

  • Download URL: toktkn-0.1.0-cp310-abi3-win32.whl
  • Upload date:
  • Size: 285.9 kB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 0e239bf1158eda52408c3b4168cd52dd50bdad234e57d9ea154536562c7e8e45
MD5 65e18cdf2d151cb4628a86d4ef1b34a0
BLAKE2b-256 a582175d95595c8e327b0886cb63b0a54fe4c12952834823c764a47e3db53795

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 49f09ef7b00531e328ed60d750540118868a501f2c1fada26096a52a2e1f5239
MD5 dbf61f3e7d7339f6e00ec60b46e4ff47
BLAKE2b-256 5ebc164de1657832e1a3e4f42a4593a064b8cdc25cb3f2bdf0715d542d0d3fd9

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 5b2bd28f740f1076e1ed579f069b0be190cf4b2800b5fa6189a4c8e24d2b6566
MD5 066b3e0d3823f72845802e7d96e10a56
BLAKE2b-256 c5e2abb17bdad62efefe724e8296b57fbe08b816e3fa298e2cd3048c075247cf

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 21bc9f47eaf47ed98c3b28294ec49eaf2ce29001d91ff4327a7a2f9250c1d590
MD5 9f0fa800ee28d016da2b050fb327e51b
BLAKE2b-256 ab308d6512ed9ed04262888cfcd13f33b524b1229dd6047a6beb10140837874a

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 1f102440d2d89609c480392ab57c917f1e6f695717fab16c128987cd6ea1d5cf
MD5 8f7485836ff0d5c0f950cf0e65b73451
BLAKE2b-256 2141528b42ae08a76063233e934babce60e1fe24d6b59231389859a367ba9f8f

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 965a9abd4486f21c3c744223264f5722e28d97951f9797e54e7c5d98b7158cdf
MD5 b1139ed894745f2ddd7cfc263173dd6d
BLAKE2b-256 fd711d1a35d301542971c9b4df94fae0375c788d6a4d79bc53c4e6898d06097a

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 2307a3d5fbab343f8664113e6ffbe946d58540af5bbfc8fc5a685e919e6b1a2b
MD5 d8340500b503a9e77aee42f469d05898
BLAKE2b-256 706627b7933d63fbd4f22f0615012c3786c1b9cb3e848e2ea59b7a166634ee67

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 953188079011e33cc1136a309ca85005dc6037f02a6c505b526b5f67598f80b6
MD5 7e85781c233b3eb43e6fb39d548ad332
BLAKE2b-256 90da731ac2892ae16185d682d91e29872895c579558aad2eae90590cf573a8fb

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 70270b1abe6d5f7755f7fd5837ccf2fce3e9d97dccd9dc038f0c9fdc731d8537
MD5 d0dbd22acc960f9cb5d46d50785c98ac
BLAKE2b-256 c0ceaa454cf15ad6689da62c743dced17a6e355181c78808d2f28f20a23cbdcc

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 562fb5ad5f014d8fae3db4fa8edc1bf189ed6b12bbb064e1f11a9e695fb67111
MD5 056da58d698757e94f022daee2184a07
BLAKE2b-256 c1f4a7d9c228c62709aa29bbf523937588ea079d8b8940bebee019031bc49f6d

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8c3f9248d931a735b1d4fe2212a4a28f7f92e4e9ed498d06fbcce210f8cf2fb9
MD5 8e1f514538ffb33e9fce01db831ead4b
BLAKE2b-256 40f782c0e74bc2673d7decc780165bed5cb481dc0fb831f2e2b03a7521aa355c

See more details on using hashes here.

File details

Details for the file toktkn-0.1.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e119b49fff458da07d7cce628c8d8081ca6ec515404d0ab7cbb41e69e225e292
MD5 01a734f53f7d0c4476f86f2e716a091d
BLAKE2b-256 a306e4dcaede970381ced374bdb74cdcc0553888070f50f5bd1f20e22d064e64

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page