Skip to main content

a minimal byte-pair encoding tokenizer implementation

Project description

🪙 toktkn

toktkn is a BPE tokenizer implemented in rust and exposed in python using pyo3 bindings.

from toktkn import BPETokenizer, TokenizerConfig

# create new tokenizer
config = TokenizerConfig(vocab_size: 10)
bpe = BPETokenizer(config)

# build encoding rules on some corpus
bpe.train("some really interesting training data here...")
text = "rust is pretty fun 🦀"

assert bpe.decode(bpe.encode(text)) == text

# serialize to disk
bpe.save_pretrained("tokenizer.json")
del(bpe)
bpe = BPETokenizer.from_pretrained("tokenizer.json")
assert(len(bpe)==10)

Install

Install toktkn from PyPI with the following

pip install toktkn

Note: if you want to build from source make sure cargo is installed!

Performance

slightly faster than openai & a lot quicker than 🤗!

alt text

Performance measured on 2.5MB from the wikitext test split using openai's tiktoken gpt2 tokenizer with tiktoken==0.6.0 and the implementation from 🤗 tokenizers at tokenizers==0.19.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toktkn-0.1.2.tar.gz (27.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

toktkn-0.1.2-cp310-abi3-win_amd64.whl (302.9 kB view details)

Uploaded CPython 3.10+Windows x86-64

toktkn-0.1.2-cp310-abi3-win32.whl (286.0 kB view details)

Uploaded CPython 3.10+Windows x86

toktkn-0.1.2-cp310-abi3-musllinux_1_2_x86_64.whl (670.6 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

toktkn-0.1.2-cp310-abi3-musllinux_1_2_i686.whl (693.2 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

toktkn-0.1.2-cp310-abi3-musllinux_1_2_armv7l.whl (751.0 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

toktkn-0.1.2-cp310-abi3-musllinux_1_2_aarch64.whl (669.4 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

toktkn-0.1.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (500.8 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

toktkn-0.1.2-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (577.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

toktkn-0.1.2-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (554.8 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

toktkn-0.1.2-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (521.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ i686

toktkn-0.1.2-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (489.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

toktkn-0.1.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (492.7 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

toktkn-0.1.2-cp310-abi3-macosx_11_0_arm64.whl (432.5 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file toktkn-0.1.2.tar.gz.

File metadata

  • Download URL: toktkn-0.1.2.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for toktkn-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9767906765dfd052f2eedc2b021f2b0fb02a2ebce6529c9ea6aa1db6e55633a7
MD5 5b39ad93e1c3f4bb96fb7bacbe3813da
BLAKE2b-256 a98fe8e697594b1ab53809c0a7f0321cb15cb0eba4f8e30760c8b7da660b8897

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: toktkn-0.1.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 302.9 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b2de995e9129916f2b870d036a19d18ca2f25f7e9ef6d08914383db05057a62b
MD5 bf91b8e3f670144adc58c4cc66506767
BLAKE2b-256 92ac33f31d8b0239dc4890adf56aa38ad58645fc6fb5b72b12257d17985c5fba

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-win32.whl.

File metadata

  • Download URL: toktkn-0.1.2-cp310-abi3-win32.whl
  • Upload date:
  • Size: 286.0 kB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 176527b5d38264a31625cd98ab9e4a9988075c1d5e7e640df7b155080d38da30
MD5 32bedf5efae41c5bb0cbf913ed3fe93c
BLAKE2b-256 1de6c75c3026b0d7ebff410bce523f36155d04e27c8fc103ff935f3aa6e3e384

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 95f5305b6239c8032ab04c72a91031726e98cf6559c8aba459f18c83b8018e8e
MD5 a816a066c658440aa3e5b3caf7df4420
BLAKE2b-256 56dad42860bec76da0c831f7636cfe05e4bd02a218aca58d00c9dac503b2a0fe

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 51c13bb28597d253d6663053df1f9d1e49fd241f06c296d9808ded67354d8528
MD5 fd49a7e774bd47cfe18075a89e075ce5
BLAKE2b-256 a5242a532902d3923a863d820c299474dc1856032bcac34160f54413f4bace16

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 b42c9c3f282b50502bd3e40aa510abc789ec5aafc814d8e67f541f4d300d32f5
MD5 78c8cd2723e2909d46da0a0e7adea3f5
BLAKE2b-256 a4775c69f5a8edcfdac2b3d3a27facd45fc81f3ed0a7bfcb3f38a66d7e37adca

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 a18ee8360446f6879b77a128c462f9c154aeb6ce8ebe38227f25d5a1bfbefc53
MD5 d533b4431ebb8379697d2addb25b607c
BLAKE2b-256 36190ba13ab1d59ca096d5dcfb0c1f6548d4da3f11db8a1f4acbc7d7170b79ad

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fb6f101a20e8411f1813d674b31b42b0c05a06a50dc4fa8ca458558068039f2a
MD5 185133f6e47e9c12a5acd5184acc3840
BLAKE2b-256 176bd774a79a55083264a514062fe2900c515bbe8068bb8d32123aab77765bf6

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 86ef31f6c4a41c9b4dcaaf8ac0d79d3c44c03f8ce3eeace8836f02cdd151e186
MD5 13bc426175c84b4b45070ed1c731ca6c
BLAKE2b-256 470e0ac8c1cc39d03cca10942ad5dcf30ff8d086c9ce1b679d627a1b3f71542e

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 a441769c1341ed74f6a3ac8b4978e81fc16cad3a60fc19c42af6254255005544
MD5 98538d1520e1798057ce6b8339c4a53e
BLAKE2b-256 f519b273b6c89b1b46861d50f29a6e6de32bb402fd2148cec3364fa7ae435fb1

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 129acd2aa4d1dd9f846b87935ad0c49d66c710e966b044c7bd66143eb6c864d3
MD5 50ebbaee1db9adf99a63c290a0b8ee2b
BLAKE2b-256 b742896261653b1d80ee053e7ad927461a04a0c8dea568c48d949c32bb62d354

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 73bccedfaba8f95afd236b29bdeeef813cdca15c6f6744088e244cf8ba82bb2b
MD5 47732aef8173585cd94e4a82843e7903
BLAKE2b-256 c5cb7ecb0c1535b4dae876a493d0301fb94c4bb902889f980de82c4ac25765d4

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 878e3cc26f7da390a0d487bf5f4817d5a6f9c6a685fc7e997894879c90a3183f
MD5 1e134ff405a1cb3920c9fff55e0d92fe
BLAKE2b-256 1b7864561195876614943064dc246581d17c2c67fea511cf99c3fc6d4ec429ed

See more details on using hashes here.

File details

Details for the file toktkn-0.1.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for toktkn-0.1.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5a849ca303021b460b0f3ed2de24c1a6db55a36a4f95c5e969d1c52e93336682
MD5 4b91709bd8e7465c8b130fe6fd24e719
BLAKE2b-256 0d5f5e98489bdbc5e218ff38993e200856edf7b1be2068c3584dc34fa678bc51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page