Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Free-threaded Python (3.14t)

tokenizers ships dedicated wheels for the free-threaded build of CPython (python3.14t). These wheels declare Py_MOD_GIL_NOT_USED, so importing tokenizers does not force the GIL back on — multi-threaded code stays GIL-free.

The full mutable API works on 3.14t — the same as on regular CPython. Setters are thread-safe: the inner tokenizer state is wrapped in a std::sync::RwLock, so concurrent tokenizer.X = … from multiple threads serialize correctly and concurrent encode operations take a read guard that blocks writers only briefly.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import ByteLevel

tok = Tokenizer(BPE())
tok.pre_tokenizer = Whitespace()                 # ✅ thread-safe on 3.14t
tok.post_processor = ByteLevel(trim_offsets=True)

Caveat — compound mutations are not atomic. Statements like tokenizer.post_processor.special_tokens = X evaluate in two steps from Python's point of view (read attribute → set attribute on the result). If another thread swaps tokenizer.post_processor between those steps, the mutation lands on an orphaned component. This is the same class of race as dict[k] = v interleaved with dict.clear() — coordinate with a Python lock if you need the compound to be atomic.

For the full thread-safety analysis, see docs/free-threading-audit.md.

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Typing support and stub generation

The compiled PyO3 extension does not expose type annotations, so editors and type checkers would otherwise see most objects as Any. To provide full typing support, we use a two-step stub generation process:

  1. Rust introspection (tools/stub-gen/): Uses pyo3-introspection to analyze the compiled extension and generate .pyi stub files
  2. Python enrichment (stub.py): Adds docstrings from the runtime module and generates forwarding __init__.py shims

Running stub generation

The easiest way to regenerate stubs is via make style:

cd bindings/python
make style

This will:

  1. Build the extension with maturin develop --release
  2. Run introspection to generate .pyi files
  3. Enrich stubs with docstrings via stub.py
  4. Format with ruff

Running manually

To run the stub generator directly:

cd bindings/python
cargo run --manifest-path tools/stub-gen/Cargo.toml
python stub.py

The stub generator automatically:

  • Builds the extension using maturin
  • Copies the built .so to the project root for introspection
  • Detects and sets PYTHONHOME for embedded Python (handles uv/venv environments)
  • Generates stubs to py_src/tokenizers/

Troubleshooting

If you encounter Python initialization errors, you can manually set PYTHONHOME:

export PYTHONHOME=$(python3 -c 'import sys; print(sys.base_prefix)')
cargo run --manifest-path tools/stub-gen/Cargo.toml

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.23.1.tar.gz (365.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.23.1-cp310-abi3-win_arm64.whl (2.7 MB view details)

Uploaded CPython 3.10+Windows ARM64

tokenizers-0.23.1-cp310-abi3-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.10+Windows x86-64

tokenizers-0.23.1-cp310-abi3-win32.whl (2.5 MB view details)

Uploaded CPython 3.10+Windows x86

tokenizers-0.23.1-cp310-abi3-musllinux_1_2_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

tokenizers-0.23.1-cp310-abi3-musllinux_1_2_i686.whl (10.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

tokenizers-0.23.1-cp310-abi3-musllinux_1_2_armv7l.whl (9.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

tokenizers-0.23.1-cp310-abi3-musllinux_1_2_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

tokenizers-0.23.1-cp310-abi3-manylinux_2_31_riscv64.whl (3.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.31+ riscv64

tokenizers-0.23.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

tokenizers-0.23.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

tokenizers-0.23.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

tokenizers-0.23.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ i686

tokenizers-0.23.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.23.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

tokenizers-0.23.1-cp310-abi3-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

tokenizers-0.23.1-cp310-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.23.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.23.1.tar.gz
  • Upload date:
  • Size: 365.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for tokenizers-0.23.1.tar.gz
Algorithm Hash digest
SHA256 1feeeadf865a7915adc25445dea30e9933e593c31bb96c277cee36de227c8bfa
MD5 8689588cd8bcafe2a1873d95c94215ba
BLAKE2b-256 c16021f715d9faba5f5407ff759472ade058ec4a507ad62bcea47cb847239a73

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 a26197957d8e4425dfba746315f3c425ea00cfa8367c5fbc4ec73447893dcea9
MD5 ce0eda6ccb5727ca31abd9cd363db428
BLAKE2b-256 cd2b2be299bab55fc595e3d38567edb1a87f86e594842968fa9515a07bdcf422

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e7bfaf995c1bdbbd21d13539decb6650967013759318627d85daeb7881af16b7
MD5 60fda19f2477ecacd02f2bfc153e7efb
BLAKE2b-256 97c92553f72aaf65a2797d4229e37fa7fbe38ffbf3e32912d31bdd78b3323e59

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-win32.whl.

File metadata

  • Download URL: tokenizers-0.23.1-cp310-abi3-win32.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 93120a930b919416da7cd10a2f606ac9919cc69cacae7980fa2140e277660948
MD5 c1e15a012de0e06b736d6e9eb0acba71
BLAKE2b-256 6add631b21433c771b1382535326f0eca80b9c9cee2e64961dd993bc9ac4669e

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e3d8f40ea6268047de7046906326abed5134f27d4e8447b23763afe5808c8a96
MD5 a8f8025c5e820da593bf46a266628ff1
BLAKE2b-256 79943ac1432bda31626071e9b6a12709b97ae05131c804b94c8f3ac622c5da32

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 120468fb4c24faf0543c835a4fabafa4deb3f20a035c9b6e83d0b553a97615d4
MD5 24ae5a3b9f12c3812651e40270a73f94
BLAKE2b-256 9ec1464019a9fb059870bfe4eebb4ba12208f3042035e258bf5e782906bd3847

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 1974288a609c343774f1b897c8b482c791ab17b75ab5c8c2b1737565c1d82288
MD5 21f2227db5880f122eb102a3d17d795e
BLAKE2b-256 feba44c2502feb1a058f096ddfb4e0996ef3225a01a388e1a9b094e91689fe93

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 9d10a6d957ef01896dc274e890eee27d41bd0e74ef31e60616f0fc311345184e
MD5 200ed98d2a3679f88f237bb05a18b8ef
BLAKE2b-256 0f80127c854da64827e5b79264ce524993a90dddcb320e5cd42412c5c02f9e8a

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-manylinux_2_31_riscv64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-manylinux_2_31_riscv64.whl
Algorithm Hash digest
SHA256 56f3a77de629917652f876294dc9fe6bad4a0c43bc229dc72e59bb23a0f4729a
MD5 08c91f7bb9c8ef96395470e77fc786a5
BLAKE2b-256 718939b6b8fc073fb6d413d0147aa333dc7eff7be65639ac9d19930a0b21bf33

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5075b405006415ea148a992d093699c66eb01952bf59f4d5727089a98bda45a4
MD5 eac1dd4d36f18206e51fedd8a2aed81a
BLAKE2b-256 0dd51353e5f677ec27c2494fb6a6725e82d56c985f53e90ec511369e7e4f02c6

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 ea5a0ce170074329faaa8ea3f6400ecde604b6678192688533af80980daae71a
MD5 51c7c1285ea22ffd81f26f06329d0282
BLAKE2b-256 7e65b8f8814eef95800f20721384136d9a1d22241d50b2874357cb70542c392f

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 53b09e85775d5187941e7bab30e941b4134ab4a7dd8c68e783d231fb7ca27c51
MD5 c37b4f6503e7d7963e65c9fda8768834
BLAKE2b-256 0c9a22f3582b3a4f49358293a5206e25317621ee4526bfe9cdaa0f07a12e770e

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 ae848657742035523fdf261773630cb819a26995fcd3d9ecae0c1daf6e5a4959
MD5 2f9a6bb6d7d8d2bb8d5bef56c59df1ad
BLAKE2b-256 b9d924827036f6e21297bfffda0768e58eb6096a4f411e932964a01707857931

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 f836ca703b89ae07919a309f9651f7a88fd5a33d5f718ba5ad0870ec0256bad6
MD5 1d6227187183c056250887184bde1a87
BLAKE2b-256 a2ef7735d226f9c7f874a6bee5e3f27fb25ecabdf207d37b8cf45286d0795893

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1bf13402aff9bc533c89cb849ec3b412dc3fbeacc9744840e423d7bf3f7dc0e3
MD5 1904b57b287f17994e1ada9b0898f09a
BLAKE2b-256 6c36e006edf031154cba92b8416057d92c3abe3635e4c4b0aa0b5b9bb39dde70

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e0948bbb1ac1d7cdfc9fb6d62c596e3b7550036ad60ecd654a66ad273326324e
MD5 7ee236c0f52f81e38ec64731897f690f
BLAKE2b-256 e26a068ed9f6e444c9d7e9d55ce134181325700f3d7f30410721bdc8f848d727

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e03d6ffcbe0d56ee9c1ccd070e70a13fa750727c0277e138152acbc0252c2224
MD5 412e6173d22fc660c94c533a9324636f
BLAKE2b-256 8739b87a87d5bb9470610b80a2d31df42fcffeaf35118b8b97952b2aff598cc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page