Skip to main content

Lightweight piece tokenization library

Project description

🥢 Curated Tokenizers

This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:

Tokenizer Binding Example model
BPE sentencepiece
Byte BPE Native RoBERTa/GPT-2
Unigram sentencepiece XLM-RoBERTa
Wordpiece Native BERT

⚠️ Warning: experimental package

This package is experimental and it is likely that the APIs will change in incompatible ways.

⏳ Install

Curated tokenizers is availble through PyPI:

pip install curated_tokenizers

🚀 Quickstart

The best way to get started with curated tokenizers is through the curated-transformers library. curated-transformers also provides functionality to load tokenization models from Huggingface Hub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curated_tokenizers-2.0.0.tar.gz (2.3 MB view details)

Uploaded Source

Built Distributions

curated_tokenizers-2.0.0-cp312-cp312-win_amd64.whl (761.3 kB view details)

Uploaded CPython 3.12 Windows x86-64

curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (775.2 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (746.4 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp312-cp312-macosx_11_0_arm64.whl (741.4 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl (775.0 kB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

curated_tokenizers-2.0.0-cp311-cp311-win_amd64.whl (760.9 kB view details)

Uploaded CPython 3.11 Windows x86-64

curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (776.9 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (749.6 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp311-cp311-macosx_11_0_arm64.whl (742.0 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl (774.6 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

curated_tokenizers-2.0.0-cp310-cp310-win_amd64.whl (760.7 kB view details)

Uploaded CPython 3.10 Windows x86-64

curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (772.8 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (745.7 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp310-cp310-macosx_11_0_arm64.whl (741.8 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl (773.9 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

curated_tokenizers-2.0.0-cp39-cp39-win_amd64.whl (762.3 kB view details)

Uploaded CPython 3.9 Windows x86-64

curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (774.8 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (747.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp39-cp39-macosx_11_0_arm64.whl (743.3 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl (775.9 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file curated_tokenizers-2.0.0.tar.gz.

File metadata

  • Download URL: curated_tokenizers-2.0.0.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.9

File hashes

Hashes for curated_tokenizers-2.0.0.tar.gz
Algorithm Hash digest
SHA256 0a8f8c527bc93a6404ec0ba8f0df215b8c205edc81a05617fbbe705438804094
MD5 42f6216ca15f18cd509001c21d9ba417
BLAKE2b-256 732223b6fe4e15a788405a64304067ca46ff104283f9c3ff9608b4b5be7fac48

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 32a6e98498f694bcc3c8b3a752f8cb3327b9001d605974d399ceacf66920f37d
MD5 3f8d7e8430f442861218977cb57e9c1e
BLAKE2b-256 bdf26e9b2915e031f7b6e377bccc8399dab151cc90ef2f1c6f14cce743321fc2

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9b948d299157c317474d7c2c325b073187c6123c5c520b299603548781b8b98a
MD5 1cb2dfa4c793a67344145741ac54e81b
BLAKE2b-256 fe330ce2cad39bdcaa8eeda2a7de7ce4613ea6f5c540d5accf0fbe51c9f5253c

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6122c8db5f239189fd4703047315adf3d623a45922bf15cbeb9970bf5833b8f4
MD5 901c2520f3adf59a6f53dd1d997dc186
BLAKE2b-256 a74ff39fb03f8d43e1c7c7dd16523cce711d7b92f83ba2772ceb7a34e235fad6

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 99562bbb36e6ec54f6a6cc41c73a2adb8002ec723c2670dcfcb2eee7f384f8a5
MD5 b413b112a69cdb4f6dad2f9b3b8127e9
BLAKE2b-256 90e84b5a4f8774f59badf1bc4fadc3ba2aa27b07308556cabe10bd7ab70a6de3

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 480f702afd1d0c53d44f8aaef08f379d769d2894c469777d5d0f9030179f21d4
MD5 365e9ee0c8c97b49c4e6c5bf253544ff
BLAKE2b-256 4c135788a0b18c126c85b6a44102df6d61e9f991ac15491b3922b73f3447516e

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 6d56eb806a791b5818cbccdf22fffe39c07eff2f2dd1a32b94f11792c646d343
MD5 62997a2d6da014ee01236faeeefc63d8
BLAKE2b-256 2b6141dc572a610e84b0f87a335f146acd21da6eed646f93bd9c8a27cc99fbfe

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 51d0d86094eb863995c6eb5d0990a064e840c9a7e273358fa4596f7958593d93
MD5 e29561a8cfff9433dc183b4f0facb199
BLAKE2b-256 3b5eecc43ae53535add247556b4768c78ab1895b0f2712901744c851b3ad3dca

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 563fc87e930aa466295cc0dc44697e369559b896db94453702693133c5abbbb8
MD5 517a72be3fdbb15899b33f8235c06770
BLAKE2b-256 7cca0e606ecb19e6938416815ce5a9e0090fe508cb0a1c6c1389686fc2272b19

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dbad2c2f40f81e9d720f8c272843b786f906712582f39d338a42a361f6b4856e
MD5 4b45d5244c2cfccf25dab048f3ab2586
BLAKE2b-256 b37358be1d08b7738ba76db1f1cd499d9bb8e72c06a89e7d2d9dac2b80605b4e

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fef8dcc273845ca24e3f3eb8ac53a4b4a45cf1a4e64d1a89370bdc95a02e8fb0
MD5 4e71a6b5680f08d196b370b9f3a24f3b
BLAKE2b-256 c480360e34d47922eb70226eeb02b946b48194c1bf41721427880be5b520aa74

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d21af1ec57faac43f3c0214878a6805abb6a56123e75ab5f011babf4cc9cfd9c
MD5 a84aaf6713efd35d5bc8d930d5cebf5f
BLAKE2b-256 385210d6f8f3ec94ee8d27dc9cf028c792de12a4466beb1eb8c1cb750b32ef0f

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c15dbf7bdf1fe0f4d2d94c8a10c7cf36c420e195d32022b1b26062fd4c61c28e
MD5 e5e43b31ca4b327249edb6a965958f46
BLAKE2b-256 69fff66937ceb5bae34da4f119f260e35066bc683989075252127187599691f6

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a04525e89b00e163a03caa942b2a2ac8c7b97258057f51f0bb16f8f3c652bae8
MD5 5eb8cc64dec047e5b0415bfa9caf0b8c
BLAKE2b-256 274b7ac743f1cdf94bfa99bff9d4d2873bfa997c881ba00f9e77ec0405a083b0

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 73a77d9ef4619f51fc043b76a8ff6845e7fc908d09fbdbd68043ca9dcfe8a597
MD5 af5651532cc4d3a0a73e57752b1f39e7
BLAKE2b-256 026b2b8774b6a86fb7be3a52a96d23ba1cb751a82adad0a2c0e71a2acd3ae7ff

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 55a565b40ddd04a127e7367ebdcb560a61699350e41586c2642481ec1d6b1964
MD5 4793d47b137d73c7f31935a4fa5d07de
BLAKE2b-256 9d412e2523b258b98755a911171434878dfc61cac54b13abc6943e417b5542ba

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 571ee5087323e9478fbf1a3c73e68392fae5908ad12a65d57f985565b7c73f1b
MD5 d130b6b2b201f884c1a346d303080466
BLAKE2b-256 6fa31969c4ca7c2cd7069ce32e0bec1d6da0a71a85636d894e1655a357901daa

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6acd7931ff5ff620a6f84ea8279312ad1d1e87a1e2014d4367c9e58b8ef23e0d
MD5 3ddccaddb09d131c66204ee5e952eab5
BLAKE2b-256 e1eb1cce7280a389cccdf7cb0b105ebd224a6acb43dc971121f71e08e1dd7f10

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 999963a5854f7cf2c90923ea1613a852e6b4474ecb4f34785a9d49805e37856e
MD5 994d293b76881d8c9d49206eaab9b6cb
BLAKE2b-256 57a9e011134ff4f9286646d68cb242275148c3a86dea6d02d06d7c4b9a40ec71

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a63a9cf96a8dc1737c3b5c9f018f0a8d185b1e7e14631b5daf22c3b3420db9c4
MD5 33ff639cf9ba2a637d5690eaa9e74260
BLAKE2b-256 0519c269314f931b444b214a3593d694943e382258d766cdd37254465afd791f

See more details on using hashes here.

File details

Details for the file curated_tokenizers-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for curated_tokenizers-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7375dd55bc72b0e4ff43b547bf369f758f710220c0887e76c483842bcbcd24a1
MD5 12c6310c3e2d05b1e3d4c6e9f27dd212
BLAKE2b-256 cbcd583cd4de0990ec9c48d070338b84e96304425f9789a6a78a26b759004277

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page