Lightweight piece tokenization library
Project description
🥢 Curated Tokenizers
This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:
Tokenizer | Binding | Example model |
---|---|---|
BPE | sentencepiece | |
Byte BPE | Native | RoBERTa/GPT-2 |
Unigram | sentencepiece | XLM-RoBERTa |
Wordpiece | Native | BERT |
⚠️ Warning: experimental package
This package is experimental and it is likely that the APIs will change in incompatible ways.
⏳ Install
Curated tokenizers is availble through PyPI:
pip install curated_tokenizers
🚀 Quickstart
The best way to get started with curated tokenizers is through the
curated-transformers
library. curated-transformers
also provides functionality to load tokenization
models from Huggingface Hub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Close
Hashes for curated_tokenizers-0.9.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce834ae5eb5971bd46dd12c4da638170d9137dda135442a6501bb1e4804aa57f |
|
MD5 | 298ce491f41a21af95e956329364b181 |
|
BLAKE2b-256 | e28cdbb6bce9984c4f6454b55b0b864182f9e186aab386d44e26c161576de2e1 |
Close
Hashes for curated_tokenizers-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c12d502cb6762ed3bb290d64966cb621d6983cb83821baefa880aa7482cd454 |
|
MD5 | a213dfa5851acdf0ed6a2f7365c03a2f |
|
BLAKE2b-256 | bc643e617d33837c5e0ab56dec8fe98b01582f3544137683de7f2c8818e0cd78 |
Close
Hashes for curated_tokenizers-0.9.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65a52cec8392cc5466afff5130ba0549792d1026d5555aaa609afa1d685687f3 |
|
MD5 | 53257e0fa03f49264815614ffd385bb9 |
|
BLAKE2b-256 | 4b357146b611b72accc5e41e2a1979daaf16dd39dd3f460be53fbdc927dee532 |
Close
Hashes for curated_tokenizers-0.9.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b13d82ddda85308fba90c21489f1ed4d02b8a9b7bfefb3c6a747f267dee8afc9 |
|
MD5 | 2e34dbdef7effcd94479913bb8b0273b |
|
BLAKE2b-256 | ff973aa8f1dfa4e2b4f5c2a74fd2ba37316bf3d372f402eaee3a5ab29978d179 |
Close
Hashes for curated_tokenizers-0.9.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea37c18a2b9322d809baed80116c0e90308fa116203b6b8c52267ad06b59ce7a |
|
MD5 | ca85ea2d863711b111d35fd755f5f746 |
|
BLAKE2b-256 | cb5c84a73f4d81d5d7513b24a16cfad74b86903b9e70e30b3761575ffef73237 |
Close
Hashes for curated_tokenizers-0.9.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ac2834499d96e35240a2ec9e5dd611317e5d1734e7f8c07c41a873a72ee95e4 |
|
MD5 | bab35474386953d2b91d58ff1c64e065 |
|
BLAKE2b-256 | d446c063b4c03720fe0dd00b18b6ff2d8968da4a85600cb210684c21ac81f392 |
Close
Hashes for curated_tokenizers-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a52dc73753e69a89e08cf1e31c0475ed9762e1f989ef802d943fb7dade78af11 |
|
MD5 | 2cd41d482debe8d0b7223209dc43dea4 |
|
BLAKE2b-256 | 7537627d318531728751f41bfa7c7c54eb79570456262404b78219da7b1e97d3 |
Close
Hashes for curated_tokenizers-0.9.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16fd2165f25326677a49efbc5e3998ebf845cde4e6c3c9f85dffa630883f1d9e |
|
MD5 | c1a273cf26083f9e8afaa8703d128886 |
|
BLAKE2b-256 | 375c46fb2d8e5e9a6d6c1397dc8f472a1eb9adf58087882b49fa7efcafeee9b0 |
Close
Hashes for curated_tokenizers-0.9.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8d63839cc94b7f2f2ac62aaade379a0ddb975799f59a862cb02ad0ea585f506 |
|
MD5 | 3e69cad1ecac77ab658ec8f3acfe3822 |
|
BLAKE2b-256 | a1c76d080fd3d67357951c04971fcd5cc421b6782dcf65d15c76bd7d9b710892 |
Close
Hashes for curated_tokenizers-0.9.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15a9019dec0414941f8a116ecd56f50bb3e47c253e402b981af4ff13619eb364 |
|
MD5 | 10165ed5aba5af911d7ab39369d413b9 |
|
BLAKE2b-256 | 78151e87404503cec28cfc97e9404cd5a73beb59794d7004869a28402f8d7825 |
Close
Hashes for curated_tokenizers-0.9.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ae432fe572f5033a1f8328656f8f4e377ebde1d5a8b1b98dca0f9d7d756700a |
|
MD5 | c30eb1d77ea8e06e9878ddf42866f0dc |
|
BLAKE2b-256 | 3b8f6cf80445a1da5ab5004751dfc4c4fbfceb10820cbff4fbd9660174fa7582 |
Close
Hashes for curated_tokenizers-0.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b940f1ffa576b156f80535c50483d3a2588eae3de0f8a3c327c485898dc8b65 |
|
MD5 | e6350726a3f13146b73b157fe08c873f |
|
BLAKE2b-256 | eb055654b101dbfbee4f8a09e1bd3833df043e4f948be53238ae83b7ad93d4cc |
Close
Hashes for curated_tokenizers-0.9.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5a094de7891bd67431ebade0974cd3b9e86e196575a41655a130265db6f5fe6 |
|
MD5 | df2289bdbc75b60b78c572e7f237a1eb |
|
BLAKE2b-256 | 3c811a4b5aa226b2d3b93b449ca5cebabc353d603b2ba418a7f73e484580e8dd |
Close
Hashes for curated_tokenizers-0.9.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8568ff2d5ec42cdef8196c6a8c7f83c6ec31336437518b75c49e55cb180e9a75 |
|
MD5 | 01084378e8d7d8d23e5acf9d90fd688a |
|
BLAKE2b-256 | 87be3ebfab2a297813d5656b9968f7dc185c63dc2929f12c07a22b38de82c4f3 |
Close
Hashes for curated_tokenizers-0.9.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a15c3ed496d88d52f9c6eca5634aae84785ed10ffd0bb99335cb81a1505ac3da |
|
MD5 | 5f6ec89025539491048e2267e6fdd2b9 |
|
BLAKE2b-256 | 81f4659edff5023c53a1694e1acba44947e6a78442d48b67feab1f92b96a76f7 |
Close
Hashes for curated_tokenizers-0.9.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ab475fd4f5df13fe1b883df21d9428b741db6b99ab636287ad002f91849b92f |
|
MD5 | 7b26d030958fdc70ec42494f96fe22c1 |
|
BLAKE2b-256 | 4d5de6285f570191cfc3e07132320f2358b2fd8d88bbb5dc4de63a6a1759f402 |
Close
Hashes for curated_tokenizers-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2aa8aa9f78a6539668cdaa462cab68c87a7003422d90d2ba276b3be38d88ec09 |
|
MD5 | 52aedf5b2d9a16b76cfed8e4baf1951c |
|
BLAKE2b-256 | 208e3fc369008b985722412104ed95ca5a200a6e848164f58acd360a0d16b322 |
Close
Hashes for curated_tokenizers-0.9.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 178044194b187af9ec71bb933999f3f457bf85bf13819ff3158618b1b44359e6 |
|
MD5 | d26cbbe5d8ac53c94ee4dd7f7ae91761 |
|
BLAKE2b-256 | 543c66d396a9833987510f4a348166ea9b9dda030a1bab1352787cf83240a5d0 |
Close
Hashes for curated_tokenizers-0.9.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24993fe871df323e14da419a7a9cba04ba609e3c537e131e7ea6345208377aab |
|
MD5 | a907f4dcc9bd32a4a1afd2cf5caa492c |
|
BLAKE2b-256 | 610a6e0b8fdc280fbc85d649dbdbe3316a97b79129d00979a4a3db6dcb52f763 |
Close
Hashes for curated_tokenizers-0.9.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 386228b08fcf7cf234adbe15152dc8381d1c4c7932780cc3f6be009de718a568 |
|
MD5 | 0a65381dcbc8776441741dc1cd5a3fa6 |
|
BLAKE2b-256 | 67dc86a8dd800d8de7715aa72ff023ff0c6fd2558ce4bf05411fc229aaf976f3 |
Close
Hashes for curated_tokenizers-0.9.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dfc5e885f438ec5c1f01b6569cebd5a5e60a806df21b0a55f4a85d1e022d0af |
|
MD5 | ada2a930f582b6d0072ac769f2beebf5 |
|
BLAKE2b-256 | 3f59b13e9ccb885d0e9a664f7b062756d483866c0485dcc0e0da19ac340ba696 |
Close
Hashes for curated_tokenizers-0.9.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed5b897ff2e7be3518d957722d5118320721715fd029d21c620c5ff2d094cb5f |
|
MD5 | 749e890def254f8cc231f77658887158 |
|
BLAKE2b-256 | 9e3d427b7a4df45fd1444cb32e44f39920a5a78f4e45f2243609e25a9b808272 |
Close
Hashes for curated_tokenizers-0.9.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb939e078be41b066d37b971a48ce2d55115f570e1d3f44d0f881c7c4e895ad7 |
|
MD5 | 31e5f07e9abeb2bf9b1e7363bdb1e558 |
|
BLAKE2b-256 | 5627f28c93523fea0c8bc7a7162db4b641af73cdeb0767b9e0c8953ce6f9e902 |
Close
Hashes for curated_tokenizers-0.9.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aba392d5181e0dc331498892dc439b74265d17ee352b5bd5dba7eff4895640be |
|
MD5 | 140303126d6f2ff0b784604d0f67f950 |
|
BLAKE2b-256 | 2eedb0258f1b590c71b75d9b68abc855016c67c5c0011cd03a8f23a68275b071 |
Close
Hashes for curated_tokenizers-0.9.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcb3279daa884a529480265caecd517889a999315917af2783bb519099f0b096 |
|
MD5 | 9621ad6111e7dc87a1aee47d229c25e2 |
|
BLAKE2b-256 | c956c2b522f2cf2f8fa72fd2c4ca53c5cc18e805b4176dddf31eb8cb71f514d9 |
Close
Hashes for curated_tokenizers-0.9.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | de548e9e5e6607cd298c1c747ce8e458055a11a031ef023809799aa7bb335df7 |
|
MD5 | 55db20e97fb7563466352d856aaaff34 |
|
BLAKE2b-256 | f8684ab0e611ffd0ad6f4d48b433bc6af51b135f28e4bc7636db8df46d46c869 |
Close
Hashes for curated_tokenizers-0.9.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc6221f5eacf8e6e7cf49a6de2d7ed1a28c1cfe772dade52553a6947d7fe4153 |
|
MD5 | ae823f4c4cd2e1f78058b6903a43dd88 |
|
BLAKE2b-256 | 7c4a5aeb5e616fa9a6e216323323bbb2d2a5c7e00d24cd26448cd3f38ee0313f |