Skip to main content

A package with common tokenizers in Python and C++

Project description

tokenizers

C++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.

SentencePiece tokenizer

Depend on https://github.com/google/sentencepiece from Google.

Tiktoken tokenizer

Adapted from https://github.com/sewenew/tokenizer.

Huggingface tokenizer

Compatible with https://github.com/huggingface/tokenizers/.

Llama2.c tokenizer

Adapted from https://github.com/karpathy/llama2.c.

Tekken tokenizer

Mistral's Tekken tokenizer (v7) with full support for special tokens, multilingual text, and instruction-tuned conversations. Provides significant efficiency gains for AI workloads:

  • Special token recognition: [INST], [/INST], [AVAILABLE_TOOLS], etc. as single tokens
  • Multilingual support: Complete Unicode handling including emojis and complex scripts
  • Production-ready: 100% decode accuracy with comprehensive test coverage
  • Python bindings: Full compatibility with mistral-common ecosystem

License

tokenizers is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pytorch_tokenizers-1.0.0-cp312-cp312-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

pytorch_tokenizers-1.0.0-cp312-cp312-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pytorch_tokenizers-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

pytorch_tokenizers-1.0.0-cp311-cp311-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pytorch_tokenizers-1.0.0-cp310-cp310-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

pytorch_tokenizers-1.0.0-cp310-cp310-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file pytorch_tokenizers-1.0.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.0.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a40f381a8b90b4d1841823d25bcb2618c76e273b57f8cd3e924b6f6466c4bafb
MD5 6e24c26ce5309d987ab486c68637c7ce
BLAKE2b-256 6e9462b05c41c72581b99e28e9b06035505708cbec5181cd2ec86eb08387dbaa

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.0.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.0.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 45a7968cf88d81866f6bc70b85e6bfe879ffe6c23dbde609ea0c8b5027e2612b
MD5 7d3c3c5b4447fa64c562701720bbada1
BLAKE2b-256 47084c40ec8d80b3cfbdcab2e77c96dd39a554e308aa5506a86503f7eaabfa2f

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 92c70a522c6f19a60b450eea99282775a5ae50956d8aa0e692a73783d0e9b7d4
MD5 8ff98f9b72ce8b887a1e7a12b8f1494a
BLAKE2b-256 8ccd691b590d935c7e176d93b0be075fe254c2db20234eb5ef8307a546c97038

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a70ceb4652c108149124e51c79f6233c2ddda842f7f8d38ec41825c4c2306ca3
MD5 7e19dc61be52d4500175edcfbe2699c0
BLAKE2b-256 1f3043ae544fd9a0149e76f339a7202cdc8e0388ad4124697470696866800126

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.0.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.0.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 af10b1eed4227df646346a80248e0a30bdf635d5aa03313c1312e5a76ab6b527
MD5 13c03344d68890979b98f8490ee75c12
BLAKE2b-256 d01980f457f9d71024c0e529dfac376cd2e5036fa87c23800f59ba31e8cd00b6

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.0.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.0.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a5213ad1d6912230234453129d84bca39dc73c769836cd2bdddc460c85bed282
MD5 4715205a1c4cc891af1faee793abb17e
BLAKE2b-256 f8960ab2b3fbd9274969d99313a8d841e01f5a8f6693a5973d200d87bbb27543

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page