A package with common tokenizers in Python and C++
Project description
tokenizers
C++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.
SentencePiece tokenizer
Depend on https://github.com/google/sentencepiece from Google.
Tiktoken tokenizer
Adapted from https://github.com/sewenew/tokenizer.
Huggingface tokenizer
Compatible with https://github.com/huggingface/tokenizers/.
Llama2.c tokenizer
Adapted from https://github.com/karpathy/llama2.c.
Tekken tokenizer
Mistral's Tekken tokenizer (v7) with full support for special tokens, multilingual text, and instruction-tuned conversations. Provides significant efficiency gains for AI workloads:
- Special token recognition: [INST], [/INST], [AVAILABLE_TOOLS], etc. as single tokens
- Multilingual support: Complete Unicode handling including emojis and complex scripts
- Production-ready: 100% decode accuracy with comprehensive test coverage
- Python bindings: Full compatibility with mistral-common ecosystem
License
tokenizers is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytorch_tokenizers-1.0.0-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pytorch_tokenizers-1.0.0-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a40f381a8b90b4d1841823d25bcb2618c76e273b57f8cd3e924b6f6466c4bafb
|
|
| MD5 |
6e24c26ce5309d987ab486c68637c7ce
|
|
| BLAKE2b-256 |
6e9462b05c41c72581b99e28e9b06035505708cbec5181cd2ec86eb08387dbaa
|
File details
Details for the file pytorch_tokenizers-1.0.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: pytorch_tokenizers-1.0.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45a7968cf88d81866f6bc70b85e6bfe879ffe6c23dbde609ea0c8b5027e2612b
|
|
| MD5 |
7d3c3c5b4447fa64c562701720bbada1
|
|
| BLAKE2b-256 |
47084c40ec8d80b3cfbdcab2e77c96dd39a554e308aa5506a86503f7eaabfa2f
|
File details
Details for the file pytorch_tokenizers-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pytorch_tokenizers-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92c70a522c6f19a60b450eea99282775a5ae50956d8aa0e692a73783d0e9b7d4
|
|
| MD5 |
8ff98f9b72ce8b887a1e7a12b8f1494a
|
|
| BLAKE2b-256 |
8ccd691b590d935c7e176d93b0be075fe254c2db20234eb5ef8307a546c97038
|
File details
Details for the file pytorch_tokenizers-1.0.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: pytorch_tokenizers-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a70ceb4652c108149124e51c79f6233c2ddda842f7f8d38ec41825c4c2306ca3
|
|
| MD5 |
7e19dc61be52d4500175edcfbe2699c0
|
|
| BLAKE2b-256 |
1f3043ae544fd9a0149e76f339a7202cdc8e0388ad4124697470696866800126
|
File details
Details for the file pytorch_tokenizers-1.0.0-cp310-cp310-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pytorch_tokenizers-1.0.0-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af10b1eed4227df646346a80248e0a30bdf635d5aa03313c1312e5a76ab6b527
|
|
| MD5 |
13c03344d68890979b98f8490ee75c12
|
|
| BLAKE2b-256 |
d01980f457f9d71024c0e529dfac376cd2e5036fa87c23800f59ba31e8cd00b6
|
File details
Details for the file pytorch_tokenizers-1.0.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: pytorch_tokenizers-1.0.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5213ad1d6912230234453129d84bca39dc73c769836cd2bdddc460c85bed282
|
|
| MD5 |
4715205a1c4cc891af1faee793abb17e
|
|
| BLAKE2b-256 |
f8960ab2b3fbd9274969d99313a8d841e01f5a8f6693a5973d200d87bbb27543
|