Skip to main content

Lightweight piece tokenization library

Project description

🥢 Curated Tokenizers

This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:

Tokenizer Binding Example model
BPE sentencepiece
Byte BPE Native RoBERTa/GPT-2
Unigram sentencepiece XLM-RoBERTa
Wordpiece Native BERT

⚠️ Warning: experimental package

This package is experimental and it is likely that the APIs will change in incompatible ways.

⏳ Install

Curated tokenizers is availble through PyPI:

pip install curated_tokenizers

🚀 Quickstart

The best way to get started with curated tokenizers is through the curated-transformers library. curated-transformers also provides functionality to load tokenization models from Huggingface Hub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curated_tokenizers-2.0.0.tar.gz (2.3 MB view hashes)

Uploaded Source

Built Distributions

curated_tokenizers-2.0.0-cp312-cp312-win_amd64.whl (761.3 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (775.2 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (746.4 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp312-cp312-macosx_11_0_arm64.whl (741.4 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl (775.0 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ x86-64

curated_tokenizers-2.0.0-cp311-cp311-win_amd64.whl (760.9 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (776.9 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (749.6 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp311-cp311-macosx_11_0_arm64.whl (742.0 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl (774.6 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

curated_tokenizers-2.0.0-cp310-cp310-win_amd64.whl (760.7 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (772.8 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (745.7 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp310-cp310-macosx_11_0_arm64.whl (741.8 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl (773.9 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

curated_tokenizers-2.0.0-cp39-cp39-win_amd64.whl (762.3 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (774.8 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

curated_tokenizers-2.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (747.6 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

curated_tokenizers-2.0.0-cp39-cp39-macosx_11_0_arm64.whl (743.3 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

curated_tokenizers-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl (775.9 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page