Skip to main content

A package with common tokenizers in Python and C++

Project description

tokenizers

C++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.

Installation (from source)

git clone git@github.com:meta-pytorch/tokenizers.git
cd ~/tokenizers
git submodule update --init --recursive
pip install -e .

SentencePiece tokenizer

Depend on https://github.com/google/sentencepiece from Google.

Tiktoken tokenizer

Adapted from https://github.com/sewenew/tokenizer.

Huggingface tokenizer

Compatible with https://github.com/huggingface/tokenizers/.

Llama2.c tokenizer

Adapted from https://github.com/karpathy/llama2.c.

Tekken tokenizer

Mistral's Tekken tokenizer (v7) with full support for special tokens, multilingual text, and instruction-tuned conversations. Provides significant efficiency gains for AI workloads:

  • Special token recognition: [INST], [/INST], [AVAILABLE_TOOLS], etc. as single tokens
  • Multilingual support: Complete Unicode handling including emojis and complex scripts
  • Production-ready: 100% decode accuracy with comprehensive test coverage
  • Python bindings: Full compatibility with mistral-common ecosystem

License

tokenizers is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pytorch_tokenizers-1.1.0-cp313-cp313-win_amd64.whl (871.6 kB view details)

Uploaded CPython 3.13Windows x86-64

pytorch_tokenizers-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

pytorch_tokenizers-1.1.0-cp313-cp313-musllinux_1_2_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ ARM64

pytorch_tokenizers-1.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.1.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ ARM64manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.1.0-cp313-cp313-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

pytorch_tokenizers-1.1.0-cp312-cp312-win_amd64.whl (871.6 kB view details)

Uploaded CPython 3.12Windows x86-64

pytorch_tokenizers-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

pytorch_tokenizers-1.1.0-cp312-cp312-musllinux_1_2_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ ARM64

pytorch_tokenizers-1.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.1.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ ARM64manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.1.0-cp312-cp312-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pytorch_tokenizers-1.1.0-cp311-cp311-win_amd64.whl (871.2 kB view details)

Uploaded CPython 3.11Windows x86-64

pytorch_tokenizers-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

pytorch_tokenizers-1.1.0-cp311-cp311-musllinux_1_2_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ ARM64

pytorch_tokenizers-1.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.1.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ ARM64manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.1.0-cp311-cp311-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pytorch_tokenizers-1.1.0-cp310-cp310-win_amd64.whl (869.5 kB view details)

Uploaded CPython 3.10Windows x86-64

pytorch_tokenizers-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

pytorch_tokenizers-1.1.0-cp310-cp310-musllinux_1_2_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ ARM64

pytorch_tokenizers-1.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pytorch_tokenizers-1.1.0-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ ARM64manylinux: glibc 2.28+ ARM64

pytorch_tokenizers-1.1.0-cp310-cp310-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file pytorch_tokenizers-1.1.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 feba0f852ab3fdd2e39ce2c60e06c0b11b4903d1faf9efc08b4f87702955fe82
MD5 c9e87f843716ea4b702cd71afa1fa3a0
BLAKE2b-256 157cbc84a7be7e4052b95e3d7d293aae4549ba02d447cda637d336261ea4d6d7

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 04644c64727d41ab77fee4b62bd4eeb1e63f2b236de73ca596fdaefcfcd76a05
MD5 b5807dc63dba09c867948f4e4edf93da
BLAKE2b-256 c58271d0284225429839da6130a185ad850a664a4c1caad89c9c0f9d99ba4cae

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp313-cp313-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp313-cp313-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 651795114a5508794e2cd964a2a04b9fe0ab39fd57547bcf0b42028859c3eea3
MD5 592143a102b55d3167c563f109b7b1de
BLAKE2b-256 fb7e4a6e060cfe6794d710a617ebfa323b2a327f019f193994b715d1fa2273d4

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0b482d4094f83eaa0bfabf759320a6c3604143684fc738970c3e136d3d689974
MD5 28a6cc9ee151033bb0b9c7bdaa0993b2
BLAKE2b-256 632bda9c484b53d76f21febc17c8120409eaf1f5661199f11c079151061449c8

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c9d76b173c530afde492d965dec22bd76c8a5c8793a39cf63aad46c394486f8b
MD5 ada0b83ab585282c101c72caf78dcf27
BLAKE2b-256 0790726bc941395d1a26f433c90f197932d774b48005608755c3fc904e7afb67

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 216f4b4c02c6b913c95c5846cdc5a9e8c2321a7057f9d441d20e983f44e664b2
MD5 20b30f899565c0723303176fcc88c029
BLAKE2b-256 0d7ed3ef50e9dbc5f22cc79bba765ab2d1822aa3b09a2c88b8607cf12b81394a

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 00510b449bfb9fe43581f1e8e48d3f50684323a65c0a7f02075d33a6e66dfeae
MD5 952c401788552ccb6d3b4e810de2da04
BLAKE2b-256 a8879edccc2865944cf5bfd6db483ca82c52d3af37a52690afb1c65ede049e1b

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ac5d9d260d75e3965e673811d7a53493b3e5ff53c8eae8b81db3357e21c08f2b
MD5 cb5223c913ab7aaad60f65840c975129
BLAKE2b-256 2c26071e911ecb78ac22fb78d9c516b97c45fb5bef3f0af9bb64913a022c4eff

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp312-cp312-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp312-cp312-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 c9db2821751b3901dc41762365b1b9a46955d70170d5fab93ad5f740b6d70262
MD5 02197a16ed956393da18f622d0abc532
BLAKE2b-256 66008bd2ed185220b74a9d90ea1f2937f46e0a542e00d3e2d4410abcaae59c1b

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e67aac3fbf94f4d8bdc49d7bea5806839ff9103a6aa8cf1166ff6c3485604edd
MD5 060bc098c85c356a800304434bae3f60
BLAKE2b-256 38c17601bdab70e167ba25f00228870c2436d6d0b9f0ec16b363a493a56eb4b3

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 975bcf81576501ad1c84949b4b8cf46724e6f7fbb942987583f134c72ab5d483
MD5 ffaf758ae29133d3e429498b606020f0
BLAKE2b-256 4f4313c11676e3efa2fc5ed47892b5ede1a35eb9a8fb50541d1dadc58414c2f2

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8f4e04ec96b005406d02e9561a42c1669706cf6f5c4972f26cde9bb493543ddc
MD5 f782644f200d4c255f17b6c02b61e622
BLAKE2b-256 b10eac6a28588a627b85a64fa90180c710ae95f8f671d7083472260f50b4a3d7

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 5b0ce806f8136275c979a665780fb207831a237e390c2d152c2234de51efdaaa
MD5 6c5427dbe5b7dc89c489ff4bd0476676
BLAKE2b-256 8d07dcc4c136e624a17bc3a9c8460fef5c782d9db6b530c81b95df1d38b0794b

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ae3b7b412b8532e5cb2b00430dfed87c52159166d7e3f973c5043c443fcaffb9
MD5 cc1dbc2f16c58fc745d258d572171d0a
BLAKE2b-256 e6a382a82d311f6d4984b480e0260b8ffc44234404e67d9df34c28c82ea93ada

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp311-cp311-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp311-cp311-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 a4527cc796822c175280eb3ebdb8bdba4a1b499f7267f4abdd4485daca934e56
MD5 511b00180c8c31a405a2fbf010cf2bba
BLAKE2b-256 b3602e8c157012ab8179ad4df191e89f473b49cde87a5faf5ef19e4b30ebb2ef

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 660faaae7f341c7a2c55bc9a50b0bba1805d6adfb3bf750e594c499a46856e1e
MD5 6c400b10f7ae5a0a05189b7fd83d9374
BLAKE2b-256 9da0a83cc8001aeac365916e03f33f908588b01ef73b843edb8069e3271e1739

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3061b907d6dd4b2f31d555a925896c28fe12dffc1e80e9e3896ef8f9faa94310
MD5 b72acd26cc2b27a3e60359bb239957fc
BLAKE2b-256 3ee6ba41f3ac74faee30e8e7147bfa644227cdeaca5000e97390f7dd34da8993

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2b7cb438d04769c774ccd5e273e1b52040c9aeb9f122783c082d2d4a5922a12f
MD5 f9c928df2533ec961fbad804015b3c64
BLAKE2b-256 6b5655a76b71c8204445db58cba3b58fedcbb33e06032b524d7530e9d94912af

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9be7396a28eccc8ebc0dbf60d072babe729b63efc1b0cbc39de7c647fd612ca1
MD5 d05318cb3d879b7f6c26d581cc12cbad
BLAKE2b-256 f3c4194ae535cb598ea6ad4bca067cf28cf8a38413b191ff95f9ac743f41328c

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 55b4b7be4a487627010eac06edf49726fc1c84f258a71a6a172c501f36fc7796
MD5 19f44567d86611d372f50cbcaa39fdf3
BLAKE2b-256 de9d70caa8d1239d168f95567a90d8ab2615ac2097455c5bf3bd04978d67bcb1

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp310-cp310-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp310-cp310-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 908670eba673362c731e73225af7438d72cc29405ae6c0879818055b70f41c44
MD5 03a87dbbb29e9e3da0fdef7199d05fa7
BLAKE2b-256 6115ffc6031b63c166be55f116625e4d0edc4dcabb0b9f1d942a31526da176bc

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0b41cdb2d437e46e41d1ad5a15da8cf15c1a5586647ecd5ba3c49c573d427e40
MD5 67db80fa166df07d2f583c209cc70e03
BLAKE2b-256 734e3b59d994a7d899992cd0b563c1460a673a61ce1f8a87d08b6d7ec83bb77a

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5f2ec5ce1afdfc54db4815ace4bdbafa3f2b49489ea3e9b4c4406062cfc1d884
MD5 6c17828a1a5173a65cb02f27a222aaef
BLAKE2b-256 acaf6934eac4538fec6ebb9193f9caee7c3b78d3d93c243a35a365b94f2b3f2a

See more details on using hashes here.

File details

Details for the file pytorch_tokenizers-1.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pytorch_tokenizers-1.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e49b88eeb1eaa6417f6a8e753b348b0b588551a6602d630162139e6cca8d7dfe
MD5 0c8bade0bc7ecb4de1bd0304bcd1ba66
BLAKE2b-256 938ed9ace07200fb03d41cbf569c692af509f9b06bcc0020fe674a1e042c8326

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page