Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Typing support and stub generation

The compiled PyO3 extension does not expose type annotations, so editors and type checkers would otherwise see most objects as Any. To provide full typing support, we use a two-step stub generation process:

  1. Rust introspection (tools/stub-gen/): Uses pyo3-introspection to analyze the compiled extension and generate .pyi stub files
  2. Python enrichment (stub.py): Adds docstrings from the runtime module and generates forwarding __init__.py shims

Running stub generation

The easiest way to regenerate stubs is via make style:

cd bindings/python
make style

This will:

  1. Build the extension with maturin develop --release
  2. Run introspection to generate .pyi files
  3. Enrich stubs with docstrings via stub.py
  4. Format with ruff

Running manually

To run the stub generator directly:

cd bindings/python
cargo run --manifest-path tools/stub-gen/Cargo.toml
python stub.py

The stub generator automatically:

  • Builds the extension using maturin
  • Copies the built .so to the project root for introspection
  • Detects and sets PYTHONHOME for embedded Python (handles uv/venv environments)
  • Generates stubs to py_src/tokenizers/

Troubleshooting

If you encounter Python initialization errors, you can manually set PYTHONHOME:

export PYTHONHOME=$(python3 -c 'import sys; print(sys.base_prefix)')
cargo run --manifest-path tools/stub-gen/Cargo.toml

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.23.0rc0.tar.gz (361.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.23.0rc0-cp310-abi3-win_arm64.whl (2.7 MB view details)

Uploaded CPython 3.10+Windows ARM64

tokenizers-0.23.0rc0-cp310-abi3-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.10+Windows x86-64

tokenizers-0.23.0rc0-cp310-abi3-win32.whl (2.5 MB view details)

Uploaded CPython 3.10+Windows x86

tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_i686.whl (10.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_armv7l.whl (9.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_31_riscv64.whl (3.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.31+ riscv64

tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ i686

tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

tokenizers-0.23.0rc0-cp310-abi3-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

tokenizers-0.23.0rc0-cp310-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.23.0rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.23.0rc0.tar.gz
  • Upload date:
  • Size: 361.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for tokenizers-0.23.0rc0.tar.gz
Algorithm Hash digest
SHA256 685c6d269444451a2cf276d3f2bf655f3d7094be20c6553e413ede86b03c637b
MD5 051985c99d48167390c6b52eb575e769
BLAKE2b-256 0bdc2ba78324f6c82284f8d3d03bba16e5771d075aa4d5e9b4ecbd87af846af2

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 0b66c5eab2ddd26e59cfe6aa1945aa8b656ea0a9a715e24171c01b5ab1987630
MD5 75659ea3391c145bc1a5b2bed272c663
BLAKE2b-256 35ec920d2b36ddddb5ce819a005d9650dc941935e534a27c48758c93388aaa5b

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 27fe690eeb35a3a7e52f47d96c2ce8ffc6f939cc51a4591be86d2c86b9881267
MD5 cb3b3d817e8b296cd586267392961fe2
BLAKE2b-256 d89b34b36f6a47fec0a160887da23f173aa8a1729fa425ee67944c9be27f58de

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 ab264a8ffdea05b5fd71a8bca6572762bde9b7aaadeba16dd25c7352a625fa71
MD5 4736cfe9da7387e4704cf50d680c41f2
BLAKE2b-256 78c4d9d587b9b32c9fca5ea901225d5c4c616802eb0082b17481d23808941641

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 33ed7df57a040ffb6f0244639619632a06f4c287ed1e77b5e70febb58f9e9a8b
MD5 49bce0f49facdc6a5e9e7633d3b10ad9
BLAKE2b-256 1195d1a6a0e6d6a9bc81b8124d83beb1fb1230310ee93938095f984a12fa336d

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 5835b35d9a4815c8a4097d4dbac79c39b780684ea417fa4a93b9165e12ff1383
MD5 e561832529f7ae962c21cfa862dbd482
BLAKE2b-256 f9f6c15a5514f50bf953b70d3d2b7fd1829aa327ba8c9c519c54623510d6f459

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 e61dff90a4ad8dc7e7e124d67756d63cf3ae57e32f04fb35bb408af91f47ea70
MD5 880c11d013e6a0b91671264e94ad9667
BLAKE2b-256 458a70c9919aefc7f514d6e98fb9be379b2850ca071a841d88900278781a07b0

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 f82b7578eaad0cbb72765d1fbaa7e7bc04c531337513a21f437b73e4617fcf46
MD5 0bb2e9d9606c58a5c3a694af9fbceed2
BLAKE2b-256 46483a4bd2ba88af778e6fa6d03e271b2bc868f495745c8be91616781bf460d9

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_31_riscv64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_31_riscv64.whl
Algorithm Hash digest
SHA256 85f29751c4490bfaefe7e0d4b18ef28cd6d5f84c411e88ca896832eb4f18dd69
MD5 cecc964ba4d480b99ff2f5b91216f9a6
BLAKE2b-256 50e4939249edee0073417b2c9447fd3b06e90c283ef6df72f3124427edae1f96

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 82167864c62a3d83880ed23dea267aa5760e3fcf16fd73f94d413baf1968b211
MD5 74f40491d7529b892caa77d40a29aa26
BLAKE2b-256 ff310e4b77ca48b302a5db827584c9784f6cdbb35380c0dd1d7668712d477bb5

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 564115d3d6d2560b0a6b833d7dc39330d2328262557fbbd5bb0a14fb09b2b6cb
MD5 45e2d6a44dd9ca1e42c44781feaa61fe
BLAKE2b-256 a4cb161e52a424aa7ffb4097e8ce343d8dc2bdc42d590601032d4a9e6e5f7da5

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 1d6add82746146a6e052295ac429949c2d8e723244aa97ffe30cfee6cd788e98
MD5 4009ccdba33621968cd6f569e2c95245
BLAKE2b-256 9cf11a3b6a30388fe7d4b57b1ea7fcd6192341e479d65e50366ee0ba13d96d14

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 c52f927516521a3e1f6b6347f8bacedaf589eadd682e7ac87dac911d832c3a73
MD5 8ea1632e6992d46399588404ae3f3769
BLAKE2b-256 e132a46ab1348d0b573dab69860eee601927b9934323e40f6f6018bb362a6013

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 bcd2a49117ad88999bc5d18d05addf67ec28e69f53e609ab07733c1f96404583
MD5 7f4330fbcdaa67dbdc16d0efad428e5e
BLAKE2b-256 1479c8a7bdfee971346119349dab62f9918de512a7e5a8177555eaa50d854e1f

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 704ffd50130f6c85aa76ad16c8218ff0f966b14c6e6cab7d0636e492e487ffa5
MD5 3eff5b481f731e60ebd49d4428f8a239
BLAKE2b-256 861154c1040ee93c8d74a364fbf4e17fd5d88e2eea940cbdba69d48d42a5a0c0

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 951be943c0657d8fd12e104731165a56d995c87533cd7f70a9444ddd7afa7708
MD5 d0c6597843ff3688b569db09d5c7d9f1
BLAKE2b-256 fa1654bd9f9e5c3641fe3d6d0e5b1cee37c58cb7520d22752c2065fc5a83caff

See more details on using hashes here.

File details

Details for the file tokenizers-0.23.0rc0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.23.0rc0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bed69208ba6f74057e18e3c8ed73d62e681ff44f7be642ddeff747247c8a7a98
MD5 5a73de82ce484ccb2eef155cb5ff7f46
BLAKE2b-256 7fb9dda4065e0f4b62e0e5a625cbaeb928a611d847171e059066b3adfdb3866f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page