Skip to main content

A package with common tokenizers in Python and C++

Project description

tokenizers

C++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.

Installation (from source)

git clone git@github.com:meta-pytorch/tokenizers.git
cd ~/tokenizers
git submodule update --init --recursive
pip install -e .

SentencePiece tokenizer

Depend on https://github.com/google/sentencepiece from Google.

Tiktoken tokenizer

Adapted from https://github.com/sewenew/tokenizer.

Huggingface tokenizer

Compatible with https://github.com/huggingface/tokenizers/.

Llama2.c tokenizer

Adapted from https://github.com/karpathy/llama2.c.

Tekken tokenizer

Mistral's Tekken tokenizer (v7) with full support for special tokens, multilingual text, and instruction-tuned conversations. Provides significant efficiency gains for AI workloads:

  • Special token recognition: [INST], [/INST], [AVAILABLE_TOOLS], etc. as single tokens
  • Multilingual support: Complete Unicode handling including emojis and complex scripts
  • Production-ready: 100% decode accuracy with comprehensive test coverage
  • Python bindings: Full compatibility with mistral-common ecosystem

License

tokenizers is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pytorch_tokenizers-1.2.0-cp313-cp313-win_amd64.whl (848.5 kB view details)

Uploaded CPython 3.13Windows x86-64

pytorch_tokenizers-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.2.0-cp313-cp313-manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.2.0-cp313-cp313-macosx_12_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 12.0+ ARM64

pytorch_tokenizers-1.2.0-cp312-cp312-win_amd64.whl (848.6 kB view details)

Uploaded CPython 3.12Windows x86-64

pytorch_tokenizers-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.2.0-cp312-cp312-manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pytorch_tokenizers-1.2.0-cp311-cp311-win_amd64.whl (849.2 kB view details)

Uploaded CPython 3.11Windows x86-64

pytorch_tokenizers-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.2.0-cp311-cp311-manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.2.0-cp311-cp311-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pytorch_tokenizers-1.2.0-cp310-cp310-win_amd64.whl (847.3 kB view details)

Uploaded CPython 3.10Windows x86-64

pytorch_tokenizers-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.2.0-cp310-cp310-manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.2.0-cp310-cp310-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file pytorch_tokenizers-1.2.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 f4304cbc66eb46c98ba21de46ae030ea08a65391ed18cde4ca3f5b48955313f0
MD5 c28d9841f566570e4e1ee3fa168fd72b
BLAKE2b-256 2bd75a2a30613c73a613b03ebd1dd8ecf695e7a9490ede195ffb2862fd9471d6

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 27e954013155ebe38bb25d1f5e5279315e145cdb31b479e9eaf74b6a793fdaa4
MD5 2c933641e6f45255185503a546f74ca4
BLAKE2b-256 18c74f6abd29eb68022458da29845caa12b525300afd885236ceb9a31358fb03

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a3d559e92145ce9d2d48b512beccd5cba8c9da920d23f4f373e376e571fe5647
MD5 9aafea7c8c96facde1ce64b1b866ffc5
BLAKE2b-256 463f7d39b2ad4d1cfcdf2d37a9ac3caaaa9f5a4a87c75b179eee1d0c17f010c1

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp313-cp313-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp313-cp313-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 9d1f4f2eb0f2ba5ce97c106c3af9ce7b5349028bf99229fae906cb836b249525
MD5 13373c751cdaa0c07ecfacbf8e69f521
BLAKE2b-256 928e600bd39381e620ff8c004a1e1215cec0fc2e9f9dbdbe77275481168c069c

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 97a82db21665023b25afb869c1bf07c31047c442df15d33e036c4712d07cb109
MD5 b7e60a5cc4480798b7ff68a3706b31c9
BLAKE2b-256 68a16fafe2ae2d09b05dea79b7696047f287bc229e366e476e9ea67dfa17ff9e

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 251f959843fd6a7af204b58f71d11b73c39e6eeb0491720170301da2ad8c251b
MD5 502e4a6be67c7b056dd2f4d931cf04aa
BLAKE2b-256 99ccdd634211e2be38067b77d4dadd59c726f8cfc19bb1ce25fd048e4c8fa0d6

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7619cf3c9f39d0ddbdfb1e2f34cd2a1e4971d6ea6d5306e3a9efb1cfc73b8c7c
MD5 6689fd70b59e1e79093b6b902f8e874d
BLAKE2b-256 1c305279d5f6415596ae6b768867dd082e4a2af22b6bc9445b0fc9a205c77e1c

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a2893a30ad3db3e0d690dc7669b0b6da68dde2fd4d81dcad608e9e66ea8848ec
MD5 9f7b06c46e0a8734d7fd36f8351106d7
BLAKE2b-256 1ecc9d77d52ab1422cacfa0d4ebb144dc85edcf45c230cb89627721f1d357229

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 465dfa86221006e8e8e4d3c557af3ab26a5932de4ebd521fd12da981f6657748
MD5 069820e987b6cea23c1822c41b4dfac3
BLAKE2b-256 4069b6716ae25ea21a1a723fc49612f0d4c6054d2444c98a1663ede0395e054a

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 47f30045d03ebd8bb52e5212df95819b5d5e707fe01f5dbe0a7ece977b72382a
MD5 3da1721498ae0c2a0925a3fdf18f825d
BLAKE2b-256 e426b16c3ba8b1b5a010e21cf1f8a2f64f68f30ab1d48297e18702b10eb8a5fe

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ba4a75a5c4850991d3d92429fe07cd5bce5cf9ad2888940a24ebc69179e52966
MD5 8e4e466de201394b8563bc7a60a02538
BLAKE2b-256 5302237089fa07fec83fc44897301cef6293be4ff155390c205d126e6e3e9d60

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 72ebc322848e1996b99b9aab44a0c0152f327ba3e8b0866d9cb336b664e3051a
MD5 17665d3a1b46323b00a8aef0ff794d35
BLAKE2b-256 46293eb5500cc1a56cee2c2e34a9e227513b6beca34d9ba7de273edeb9818702

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 2191cb3efa7dc2e8b123b197d0a2cb0028e5c9dd257cb5d68410391a32db63c4
MD5 7eeaf58799758a451acec6205e8a3924
BLAKE2b-256 27f812e0c33a794524b00f1ea1fdc9e4240099348fbd5825fa0d2bbb49128911

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 237a9c9f9f6473b54574f62058b8e20880bf931673fbc764907c52d5b98a4178
MD5 db2a4eabfda573f27c6abc903f071a32
BLAKE2b-256 2f89fcfc37fabd27832f612a2ac7ae4715dcff2971a38a2ab66ae35c69fc098d

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d0d70766876780ce95e2dac8b0b90d0093f192699cd42474315c8b6102665f3f
MD5 a1f49dd4241170908b2794c61b455028
BLAKE2b-256 d82f433bec2e29260fa77785c8c1efc4c8101c56c3bda4a2c97980b62ce1d585

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.2.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.2.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8ef7c9ac07b59aff7e382b489c63613120ec4047c19c7493d60b404649a885c5
MD5 103059101b3a79b484982d5b5a0cbf65
BLAKE2b-256 5a045def6b0f5003b8c1993703e0c9c6d982c9b3d9e93c169a91ca307e3b45c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page