Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Free-threaded Python (3.14t)

tokenizers ships dedicated wheels for the free-threaded build of CPython (python3.14t). These wheels declare Py_MOD_GIL_NOT_USED, so importing tokenizers does not force the GIL back on — multi-threaded code stays GIL-free.

The full mutable API works on 3.14t — the same as on regular CPython. Setters are thread-safe: the inner tokenizer state is wrapped in a std::sync::RwLock, so concurrent tokenizer.X = … from multiple threads serialize correctly and concurrent encode operations take a read guard that blocks writers only briefly.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import ByteLevel

tok = Tokenizer(BPE())
tok.pre_tokenizer = Whitespace()                 # ✅ thread-safe on 3.14t
tok.post_processor = ByteLevel(trim_offsets=True)

Caveat — compound mutations are not atomic. Statements like tokenizer.post_processor.special_tokens = X evaluate in two steps from Python's point of view (read attribute → set attribute on the result). If another thread swaps tokenizer.post_processor between those steps, the mutation lands on an orphaned component. This is the same class of race as dict[k] = v interleaved with dict.clear() — coordinate with a Python lock if you need the compound to be atomic.

For the full thread-safety analysis, see docs/free-threading-audit.md.

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Typing support and stub generation

The compiled PyO3 extension does not expose type annotations, so editors and type checkers would otherwise see most objects as Any. To provide full typing support, we use a two-step stub generation process:

  1. Rust introspection (tools/stub-gen/): Uses pyo3-introspection to analyze the compiled extension and generate .pyi stub files
  2. Python enrichment (stub.py): Adds docstrings from the runtime module and generates forwarding __init__.py shims

Running stub generation

The easiest way to regenerate stubs is via make style:

cd bindings/python
make style

This will:

  1. Build the extension with maturin develop --release
  2. Run introspection to generate .pyi files
  3. Enrich stubs with docstrings via stub.py
  4. Format with ruff

Running manually

To run the stub generator directly:

cd bindings/python
cargo run --manifest-path tools/stub-gen/Cargo.toml
python stub.py

The stub generator automatically:

  • Builds the extension using maturin
  • Copies the built .so to the project root for introspection
  • Detects and sets PYTHONHOME for embedded Python (handles uv/venv environments)
  • Generates stubs to py_src/tokenizers/

Troubleshooting

If you encounter Python initialization errors, you can manually set PYTHONHOME:

export PYTHONHOME=$(python3 -c 'import sys; print(sys.base_prefix)')
cargo run --manifest-path tools/stub-gen/Cargo.toml

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.23.1rc0.tar.gz (365.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.23.1rc0-cp310-abi3-win_arm64.whl (2.7 MB view details)

Uploaded CPython 3.10+Windows ARM64

tokenizers-0.23.1rc0-cp310-abi3-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.10+Windows x86-64

tokenizers-0.23.1rc0-cp310-abi3-win32.whl (2.5 MB view details)

Uploaded CPython 3.10+Windows x86

tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_i686.whl (10.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_armv7l.whl (9.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_31_riscv64.whl (3.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.31+ riscv64

tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ i686

tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

tokenizers-0.23.1rc0-cp310-abi3-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

tokenizers-0.23.1rc0-cp310-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.23.1rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.23.1rc0.tar.gz
  • Upload date:
  • Size: 365.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for tokenizers-0.23.1rc0.tar.gz
Algorithm Hash digest
SHA256 114d59535428f456a4b17979652e81377800207c13b67ef0eb82a42cd9ae3579
MD5 8e4cb2a8cf527ae8109708286bd63b5b
BLAKE2b-256 0b1d13694ce0dc689df0465d328040b5dd7bf4c9abd95f018c78c8b9e63258e1

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 a4beaee218bc628293abee773b965fdcd532d595f8982ad7cd215e30cebddb8e
MD5 133426310ef15006657d727feaf34f58
BLAKE2b-256 360d9722f0b5ceb30e57aee795d1ebe570b27d88aadfbdce619bcf70ab95599e

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 30199bc27190f7f342b7b70bf1875ab0d00766c01550283de36d3b328c3b6132
MD5 f3991cd96b359329fcc92b010ee25bd9
BLAKE2b-256 9ccc0678c052ff12f70e49e32a72a1baf65d412e613f45a59cb97a201d455eec

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 288d5c1a70265702a25aa8e60af6aed472d5f26977d108cda261380e1fb97caa
MD5 5ccf0a708085eb9be26a37f720b98d6a
BLAKE2b-256 717c53fae1717d052d43073f34eefa3c70255857a02c3758fb2098b00464c9b2

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 59e9b4761a9168b7bf722fc6aa2fc31565b763fb71dc59ed833059d7b16c25ff
MD5 38fe21910f5f9d32096732776ce16ae9
BLAKE2b-256 dc65431462e011edefe20a4b1e8204e518b44db09745f4660cb11ef61edb0111

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 9584d7bcfe5c9a076b969b9f87dc174a019df15e10c2f8ea8ba7af2aa27afdc9
MD5 f3491fcced2031e3f9672b9345ebc5a7
BLAKE2b-256 bfb869063aa7a941968180358d652dc0087c30d5e1832a60fb4cc7eac7449ece

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 567a3553cf61cd345b241c2509a82825fdf84ec86856da77aa4d4062541aa4a3
MD5 0783398b08065e2764625ee01c352cae
BLAKE2b-256 0d0057e560516924208451cfcb72af6f2d4cc32de79c73629d1ec2dc4e218896

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 f7ca6f0ca5a5dc038b5d5be6ebb722dc4ca55424ae5b1424f8f2a4359de94603
MD5 ec9909998ee70498e08c6e2c4d1df6dd
BLAKE2b-256 b2f78339e4ba305a7ce7723417ffee2b2435b6cf31b48b4e7b5b7051c7355f63

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_31_riscv64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_31_riscv64.whl
Algorithm Hash digest
SHA256 ffd1a7e8e85f1032c12b0113478dd9a725c05315750a72b0b86784e81a877155
MD5 d05f0a5479411c6733a2844e54eb2ffa
BLAKE2b-256 e1f5fa1e0209b01e7adbfa5ec152a8d851b32a021c55d9dcce3dcb355c40c7ef

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dc3f6f77ae62bf0dec33a86f7cd0e43dcc30d587347af42bb610344d59da5da9
MD5 0ad198dd0f9f7073c64afafee6f32cb7
BLAKE2b-256 b61a24f44d0c6ebb131b27ac014213fcce00672285a4096373d3eac431b558e9

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 5831c5864ee92feda0cca8cb5bdfb058e113fba563125401f9f634cc7b614fa4
MD5 f19fc71cf481c215a166cdda88de3f11
BLAKE2b-256 4d44f54e6598279a45d7fb934f9de89cb2bcfa85b3139ff832ec47f01260e9fb

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 d7ec466ea578c5fa635f373b220e6f7c06f8f782598a22351dce8bfc7d356f3f
MD5 ac1ec2ecfff022923d2575547f13b1b4
BLAKE2b-256 c846e7c2103aa91362bf0b6a97933678f0287e53889e7c01259e8ab31a8175a6

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 1ec9cc3b8e08773a69d791f95aa3d92e8c1ef301ab61a95d56cd7f642e52dee5
MD5 74a938449b94b8ed2a38e9336c7d2d13
BLAKE2b-256 7c2a1e53232358a95b1631051aec071256461b19e7613fd420b169b6a1178a80

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 bb57c1abe7aa72ed37251359e54eec7491ca117e8aa0b08b257515c9faf72cbe
MD5 70cb9189c8432bfbd60e560a3b912dbb
BLAKE2b-256 a980e370ac05e74e4923431b8f5281c92e30b8fce98ef92ba3ef6b3ce82b3ef1

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d76963c31b4d3ce1668278962247b001dc2d8b261e6e57704d16bf6bc0ab7d5b
MD5 f0c17c449a36d924b6da77152bb83ba4
BLAKE2b-256 2c85c04c34ffb7e415d74a0f82753f7f306c95e795806f8eba1f9264ddba431b

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f56166e028de3ecbea49c588f3b70317f2708d6b3aa72290de429ea8b6d3287d
MD5 cab877526a2341626a7285ea1233801c
BLAKE2b-256 e2276cbdde780c5130cfa647df0abc4bb94dc9afa0e9535af857035bb7ba4356

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1rc0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1rc0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 37244d6c2e881bf7f66d4412b35181461014b337f193071da562a097189d903e
MD5 8d042892e66c08149523e469e4df6bd9
BLAKE2b-256 2244884792c24a36b7c34c5e722fb3a5a948db93c70f93b7cd7022adafd00f67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page