Lightweight piece tokenization library
Project description
🥢 Curated Tokenizers
This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:
Tokenizer | Binding | Example model |
---|---|---|
BPE | sentencepiece | |
Byte BPE | Native | RoBERTa/GPT-2 |
Unigram | sentencepiece | XLM-RoBERTa |
Wordpiece | Native | BERT |
⚠️ Warning: experimental package
This package is experimental and it is likely that the APIs will change in incompatible ways.
⏳ Install
Curated tokenizers is availble through PyPI:
pip install curated_tokenizers
🚀 Quickstart
The best way to get started with curated tokenizers is through the
curated-transformers
library. curated-transformers
also provides functionality to load tokenization
models from Huggingface Hub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file curated_tokenizers-2.0.0.tar.gz
.
File metadata
- Download URL: curated_tokenizers-2.0.0.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a8f8c527bc93a6404ec0ba8f0df215b8c205edc81a05617fbbe705438804094 |
|
MD5 | 42f6216ca15f18cd509001c21d9ba417 |
|
BLAKE2b-256 | 732223b6fe4e15a788405a64304067ca46ff104283f9c3ff9608b4b5be7fac48 |
File details
Details for the file curated_tokenizers-2.0.0-cp312-cp312-win_amd64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 761.3 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32a6e98498f694bcc3c8b3a752f8cb3327b9001d605974d399ceacf66920f37d |
|
MD5 | 3f8d7e8430f442861218977cb57e9c1e |
|
BLAKE2b-256 | bdf26e9b2915e031f7b6e377bccc8399dab151cc90ef2f1c6f14cce743321fc2 |
File details
Details for the file curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 775.2 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b948d299157c317474d7c2c325b073187c6123c5c520b299603548781b8b98a |
|
MD5 | 1cb2dfa4c793a67344145741ac54e81b |
|
BLAKE2b-256 | fe330ce2cad39bdcaa8eeda2a7de7ce4613ea6f5c540d5accf0fbe51c9f5253c |
File details
Details for the file curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 746.4 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6122c8db5f239189fd4703047315adf3d623a45922bf15cbeb9970bf5833b8f4 |
|
MD5 | 901c2520f3adf59a6f53dd1d997dc186 |
|
BLAKE2b-256 | a74ff39fb03f8d43e1c7c7dd16523cce711d7b92f83ba2772ceb7a34e235fad6 |
File details
Details for the file curated_tokenizers-2.0.0-cp312-cp312-macosx_11_0_arm64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 741.4 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99562bbb36e6ec54f6a6cc41c73a2adb8002ec723c2670dcfcb2eee7f384f8a5 |
|
MD5 | b413b112a69cdb4f6dad2f9b3b8127e9 |
|
BLAKE2b-256 | 90e84b5a4f8774f59badf1bc4fadc3ba2aa27b07308556cabe10bd7ab70a6de3 |
File details
Details for the file curated_tokenizers-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl
- Upload date:
- Size: 775.0 kB
- Tags: CPython 3.12, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 480f702afd1d0c53d44f8aaef08f379d769d2894c469777d5d0f9030179f21d4 |
|
MD5 | 365e9ee0c8c97b49c4e6c5bf253544ff |
|
BLAKE2b-256 | 4c135788a0b18c126c85b6a44102df6d61e9f991ac15491b3922b73f3447516e |
File details
Details for the file curated_tokenizers-2.0.0-cp311-cp311-win_amd64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 760.9 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d56eb806a791b5818cbccdf22fffe39c07eff2f2dd1a32b94f11792c646d343 |
|
MD5 | 62997a2d6da014ee01236faeeefc63d8 |
|
BLAKE2b-256 | 2b6141dc572a610e84b0f87a335f146acd21da6eed646f93bd9c8a27cc99fbfe |
File details
Details for the file curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 776.9 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51d0d86094eb863995c6eb5d0990a064e840c9a7e273358fa4596f7958593d93 |
|
MD5 | e29561a8cfff9433dc183b4f0facb199 |
|
BLAKE2b-256 | 3b5eecc43ae53535add247556b4768c78ab1895b0f2712901744c851b3ad3dca |
File details
Details for the file curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 749.6 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 563fc87e930aa466295cc0dc44697e369559b896db94453702693133c5abbbb8 |
|
MD5 | 517a72be3fdbb15899b33f8235c06770 |
|
BLAKE2b-256 | 7cca0e606ecb19e6938416815ce5a9e0090fe508cb0a1c6c1389686fc2272b19 |
File details
Details for the file curated_tokenizers-2.0.0-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 742.0 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbad2c2f40f81e9d720f8c272843b786f906712582f39d338a42a361f6b4856e |
|
MD5 | 4b45d5244c2cfccf25dab048f3ab2586 |
|
BLAKE2b-256 | b37358be1d08b7738ba76db1f1cd499d9bb8e72c06a89e7d2d9dac2b80605b4e |
File details
Details for the file curated_tokenizers-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl
- Upload date:
- Size: 774.6 kB
- Tags: CPython 3.11, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fef8dcc273845ca24e3f3eb8ac53a4b4a45cf1a4e64d1a89370bdc95a02e8fb0 |
|
MD5 | 4e71a6b5680f08d196b370b9f3a24f3b |
|
BLAKE2b-256 | c480360e34d47922eb70226eeb02b946b48194c1bf41721427880be5b520aa74 |
File details
Details for the file curated_tokenizers-2.0.0-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 760.7 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d21af1ec57faac43f3c0214878a6805abb6a56123e75ab5f011babf4cc9cfd9c |
|
MD5 | a84aaf6713efd35d5bc8d930d5cebf5f |
|
BLAKE2b-256 | 385210d6f8f3ec94ee8d27dc9cf028c792de12a4466beb1eb8c1cb750b32ef0f |
File details
Details for the file curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 772.8 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c15dbf7bdf1fe0f4d2d94c8a10c7cf36c420e195d32022b1b26062fd4c61c28e |
|
MD5 | e5e43b31ca4b327249edb6a965958f46 |
|
BLAKE2b-256 | 69fff66937ceb5bae34da4f119f260e35066bc683989075252127187599691f6 |
File details
Details for the file curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 745.7 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a04525e89b00e163a03caa942b2a2ac8c7b97258057f51f0bb16f8f3c652bae8 |
|
MD5 | 5eb8cc64dec047e5b0415bfa9caf0b8c |
|
BLAKE2b-256 | 274b7ac743f1cdf94bfa99bff9d4d2873bfa997c881ba00f9e77ec0405a083b0 |
File details
Details for the file curated_tokenizers-2.0.0-cp310-cp310-macosx_11_0_arm64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 741.8 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73a77d9ef4619f51fc043b76a8ff6845e7fc908d09fbdbd68043ca9dcfe8a597 |
|
MD5 | af5651532cc4d3a0a73e57752b1f39e7 |
|
BLAKE2b-256 | 026b2b8774b6a86fb7be3a52a96d23ba1cb751a82adad0a2c0e71a2acd3ae7ff |
File details
Details for the file curated_tokenizers-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl
- Upload date:
- Size: 773.9 kB
- Tags: CPython 3.10, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55a565b40ddd04a127e7367ebdcb560a61699350e41586c2642481ec1d6b1964 |
|
MD5 | 4793d47b137d73c7f31935a4fa5d07de |
|
BLAKE2b-256 | 9d412e2523b258b98755a911171434878dfc61cac54b13abc6943e417b5542ba |
File details
Details for the file curated_tokenizers-2.0.0-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 762.3 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 571ee5087323e9478fbf1a3c73e68392fae5908ad12a65d57f985565b7c73f1b |
|
MD5 | d130b6b2b201f884c1a346d303080466 |
|
BLAKE2b-256 | 6fa31969c4ca7c2cd7069ce32e0bec1d6da0a71a85636d894e1655a357901daa |
File details
Details for the file curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 774.8 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6acd7931ff5ff620a6f84ea8279312ad1d1e87a1e2014d4367c9e58b8ef23e0d |
|
MD5 | 3ddccaddb09d131c66204ee5e952eab5 |
|
BLAKE2b-256 | e1eb1cce7280a389cccdf7cb0b105ebd224a6acb43dc971121f71e08e1dd7f10 |
File details
Details for the file curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 747.6 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 999963a5854f7cf2c90923ea1613a852e6b4474ecb4f34785a9d49805e37856e |
|
MD5 | 994d293b76881d8c9d49206eaab9b6cb |
|
BLAKE2b-256 | 57a9e011134ff4f9286646d68cb242275148c3a86dea6d02d06d7c4b9a40ec71 |
File details
Details for the file curated_tokenizers-2.0.0-cp39-cp39-macosx_11_0_arm64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 743.3 kB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a63a9cf96a8dc1737c3b5c9f018f0a8d185b1e7e14631b5daf22c3b3420db9c4 |
|
MD5 | 33ff639cf9ba2a637d5690eaa9e74260 |
|
BLAKE2b-256 | 0519c269314f931b444b214a3593d694943e382258d766cdd37254465afd791f |
File details
Details for the file curated_tokenizers-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: curated_tokenizers-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 775.9 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7375dd55bc72b0e4ff43b547bf369f758f710220c0887e76c483842bcbcd24a1 |
|
MD5 | 12c6310c3e2d05b1e3d4c6e9f27dd212 |
|
BLAKE2b-256 | cbcd583cd4de0990ec9c48d070338b84e96304425f9789a6a78a26b759004277 |