Lightweight piece tokenization library
Project description
🥢 Curated Tokenizers
This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:
Tokenizer | Binding | Example model |
---|---|---|
BPE | sentencepiece | |
Byte BPE | Native | RoBERTa/GPT-2 |
Unigram | sentencepiece | XLM-RoBERTa |
Wordpiece | Native | BERT |
⚠️ Warning: experimental package
This package is experimental and it is likely that the APIs will change in incompatible ways.
⏳ Install
Curated tokenizers is availble through PyPI:
pip install curated_tokenizers
🚀 Quickstart
The best way to get started with curated tokenizers is through the
curated-transformers
library. curated-transformers
also provides functionality to load tokenization
models from Huggingface Hub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Close
Hashes for curated-tokenizers-0.9.0.dev0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a333fcd0a827e3941fb14e104b0f72c6903f1291eedbb3b992e45e917b33efb |
|
MD5 | afeb2b78e93d145d69194d8c42bf12a0 |
|
BLAKE2b-256 | c7c0cd86813bdbab69b45ff426152ba65dc60ef8fe68feee1280cb7199bcc730 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 593e5fbd3b39a83dce0896eb8604e5c75e41170b2aa4d955e6160750036d9226 |
|
MD5 | 19a1f368fdc69d9bc407398995a1f604 |
|
BLAKE2b-256 | 827310e7558bc87c0ec6fafc965409a39c160f9eede0aa9664956e641d7b415e |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c451513fdc495b3b5180e5c93d23e74963c578121be14437bd2c904980f58d2c |
|
MD5 | 28147e38d23a6677776e3e072dd1bd7b |
|
BLAKE2b-256 | 571a06711466098c8cfd356719383078843a5e8f796be677ab3984ca55f519c9 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f61884a20cd9c5cb3ed003669303edafc3b928d4d77d77391fb6b0bf3734a7dd |
|
MD5 | bee9fdbc1013c0da54f1e8fb2902ff6c |
|
BLAKE2b-256 | 21d992f92206e358f307e385a828d28672588744146d5ffb11fed6e13131b08f |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4021029e68ce92f856e503a894a11656ed4b60f2300891dd575ef371b958ed0 |
|
MD5 | 95be0213f86b618d54e62330d9be70cc |
|
BLAKE2b-256 | c18fbda661f4474a397ddf30e376d030debb75d15b586b0234bbb3845866fd85 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bcacb9c200fb416d3df3aba558c5303588a25a1b929c45b584924211c2627a6 |
|
MD5 | 59f78c539d8ba57e9ac1016d602b2add |
|
BLAKE2b-256 | 0013fa11ecba08c8cbeda04e646147dd1b83bb86d9c3e9c2c72bb314f3a3bd8f |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e843b8cba6d58aa2000edb98d08c54db02402d5e837595ed417fc0c426f02ffa |
|
MD5 | 6ef607a3601e332a1ed42a7eaee8ad81 |
|
BLAKE2b-256 | e9e28b708ebab16c43a5e0ac87239fb61f540f0a05c7ac9ee2b801a31daabcf0 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d00aef6e57c20b2102e86893fc59a20d1ee8a1576f581d54080e28838da31320 |
|
MD5 | 8179c58a6d4cd2ee4ca19ce67d60195a |
|
BLAKE2b-256 | 845e0ec2c51711d9412106ef37b672e73284f0f649de79109f342f72d15d01bc |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5808c78a630392ab4c0fff122a71bde61e30e9bf04a53e2e8dfa0d13d89f503 |
|
MD5 | 649fd2121ae145cead80913152bf6a77 |
|
BLAKE2b-256 | 74caf2eb926cca9ed8a3db36d05b83559a2b78b8f15fbea649a85831da82901a |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c08be2fba857e5f083347847df7b3b80fc311b64cb181f86a5f28307ff25d647 |
|
MD5 | 62996befb2a127feb9d53bc159ff329d |
|
BLAKE2b-256 | 8ef1f0314c685b3e9ee1a869c7eab934668ca9a28e204b0bde591605aad8cec4 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 308114040ba6e4b4b29d8c0854862647736e290021cf8e83d1d0b89919d247ed |
|
MD5 | 9ab96272e7ca4f0a94408c27eb0dafd7 |
|
BLAKE2b-256 | 99a1903ff557b6188f638da1c0bdf44859a7a16c238dc17042f204a446248052 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8868e4508b2dcb0b23fbded9b40e8265ba70149e59b4336aeb5947d5f0591c4a |
|
MD5 | 54801e8e378cd4cba6b8c4241c1e4f3f |
|
BLAKE2b-256 | d25b1c3debc2741fdb63a36e44cf992e45083a4f0b3b34fbdc0b70ac9f76946f |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e603e98ea4e06e55f385d4dca3597e516d9319f184245da6e9767fbe0f0632c6 |
|
MD5 | ea9f64e0c316aee4d248c5b332b77e12 |
|
BLAKE2b-256 | 07a4a8cb4a9bc0e132b7f624b18dbde4db1dfb6780400148ec2ef8c94dbe016a |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc878be84a36067982286fca315210ee721073940c25af6efe279417884864d5 |
|
MD5 | 9a015c3287e0c76bafbd48629f4c2e80 |
|
BLAKE2b-256 | bfdf23cdcb8fc5d09bc51b1891919f3ce275bf5daa0d895470d80f4a49898d55 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 041aa4c24bcbaef14c281d75572e0b03bb95a0c041b9f25b4ff55e6e0aed2f06 |
|
MD5 | b0de1028a4fd833cd159c2d797896a07 |
|
BLAKE2b-256 | 78a47ef598a71c57bc8109c7acff908124c713faf1c05b2d47354aa0cbd3cea9 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec89e7911c4aa0033ef705a1368b690b7592c4e28a18c3d29a86ae70b0adcd3a |
|
MD5 | 58878c331ad86fb773c5b60fb97c5912 |
|
BLAKE2b-256 | 776226aff211d9ef75bc7fb6343f52348c4f864ccb905fcfedc883b972f8fc29 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05d000fcc355827735dbbb78d151d6a57b03ad7906d6a4bb753a56692a08abd7 |
|
MD5 | daac0791fcb16463339f73be5189c2ff |
|
BLAKE2b-256 | cbbdc496c3d7e9236f2d53199540c73f92909e4481f94c4269bf4e39967bbf61 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b31415264222171bf8d5f56444a9c40243796b009af18d2c5edf9502bd8e9cd8 |
|
MD5 | 3dbfdb66453ab506e1dba63b242c33de |
|
BLAKE2b-256 | 5fcac0f95ec5670e4c3d28deceb61310f478b6a2126d40ea1329fa2a11fef51c |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba38dcec2cacf7e96441804454e8074893dbe3c11c62fbdce03e49a5055e68ba |
|
MD5 | ca2c3ece5e12ab6b8bacf02818f54958 |
|
BLAKE2b-256 | 805da5e9a21056283ce85194bd9fe2fb6e3dbe6f6b3dff2d2cc7d3e889b26060 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d821f0b57feca4ddbf63e0e6fd6a1f80d8f59a5bca5996380f2265b53b1d57f |
|
MD5 | aaf561ee544f51f7b803701d24e37de6 |
|
BLAKE2b-256 | 28bfff41741211ca0e279aa84cde66eb56285b6e411a8380a521963653faf0a6 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c8e958cbf9933272201fe59b9725c624945205e5392fc5bc8f7382a501d5bd5 |
|
MD5 | e225bf07ccb772b68086f968de9cfbe7 |
|
BLAKE2b-256 | adac95827581f2e60cd223f89802d12f0ecd6931cc3b405c4697b63909bb9830 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5512b93ae5167da581c2ffedbda3f3d71809540b91c6d927ad0b1eed49d92a76 |
|
MD5 | 6b5f6d5eac2fa8df793876ff49a1173e |
|
BLAKE2b-256 | d394876dc371ef17abfa41c0af8b959bc3f174a1c368ebf85bb5f0463a9fd456 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71d2c73bcb6fe0c1c351f1613f832ec2be3db4af903fa33a0b567d27567be098 |
|
MD5 | 522dd57edce3d9c04b464f46e8bc36b8 |
|
BLAKE2b-256 | 4156bb26e6ff955c67f807af2fcd7f19c3aae72553a85dfbe8bee4b4e5fc272b |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37a9c00f63d8bf22ce1fd4eb3a7c6283ec59c3472e4aabda3e07bc22f4344ad8 |
|
MD5 | f999477e61a849cf7452260595d31ae6 |
|
BLAKE2b-256 | 02a0125b9d8d2f47dc1613e1679e008f10d363749dd36aadb772a30f98d57e66 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ff992bf43b4ca1375fbbe6e903da963c771be7ffb2cd3b443e1ed4b510f8893 |
|
MD5 | 0221836bc86215df6f2d3668164fb979 |
|
BLAKE2b-256 | eeeedfa73eb79c63ca0f9767e19803a4ba24977a65ff77087032be0431c074da |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fecd1d7cf02e1711f5851d1a102ee4ac523e0418e53664974e6768fbef851a97 |
|
MD5 | 5800059fa99f3f218eac5ee3f2315a6e |
|
BLAKE2b-256 | 3b2043311bc0141898fae817c20d4df5342c37022123db008e45a13691b99953 |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c30e24d21746899c9af3facb70b164c15da0ef500fccb237d8f062cb17adfeef |
|
MD5 | bd6b5930b4c178b0fc396937f417ea89 |
|
BLAKE2b-256 | 42cd884b8f2ae4debdfa51401854f0907b6fda5a7b325b45ffbf5992a0a0156b |
Close
Hashes for curated_tokenizers-0.9.0.dev0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8bd41f9e43834f5bcf9da817a98d4f89072324bcbc3ed7e04b3c1f15ee4f869 |
|
MD5 | ecd4f537ba455dadc732fad5d171a606 |
|
BLAKE2b-256 | c1d7016f0e8864c0c4278414afdcdef1bd38a3a401ddd1b1abfa608fde28b4f1 |