Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.22.1.tar.gz (363.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.22.1-cp39-abi3-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.22.1-cp39-abi3-win32.whl (2.5 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.22.1-cp39-abi3-musllinux_1_2_x86_64.whl (9.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.22.1-cp39-abi3-musllinux_1_2_i686.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.22.1-cp39-abi3-musllinux_1_2_armv7l.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.22.1-cp39-abi3-musllinux_1_2_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.22.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.22.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.22.1-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.22.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.22.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl (2.9 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.22.1-cp39-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.22.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.22.1.tar.gz
  • Upload date:
  • Size: 363.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for tokenizers-0.22.1.tar.gz
Algorithm Hash digest
SHA256 61de6522785310a309b3407bac22d99c4db5dba349935e99e4d15ea2226af2d9
MD5 6b3f4c4c96bf540a9786cdf7b5cf8892
BLAKE2b-256 1c46fb6854cec3278fbfa4a75b50232c77622bc517ac886156e6afbfa4d8fc6e

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 65fd6e3fb11ca1e78a6a93602490f134d1fdeb13bcef99389d5102ea318ed138
MD5 6da3059f54cc23bdb89366ab683b8ff4
BLAKE2b-256 b346e33a8c93907b631a99377ef4c5f817ab453d0b34f93529421f42ff559671

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-win32.whl.

File metadata

  • Download URL: tokenizers-0.22.1-cp39-abi3-win32.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 b5120eed1442765cd90b903bb6cfef781fd8fe64e34ccaecbae4c619b7b12a82
MD5 a92da7c1d2fc29ee1be6902f37812413
BLAKE2b-256 302c959dddef581b46e6209da82df3b78471e96260e2bc463f89d23b1bf0e52a

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a0f307d490295717726598ef6fa4f24af9d484809223bbc253b201c740a06390
MD5 d862e8d27751438c2a0ca3e1ea7cc324
BLAKE2b-256 36657e75caea90bc73c1dd8d40438adf1a7bc26af3b8d0a6705ea190462506e1

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 607989f2ea68a46cb1dfbaf3e3aabdf3f21d8748312dbeb6263d1b3b66c5010a
MD5 ce06c7a92a3748b2a829feb9a772ef18
BLAKE2b-256 517ca5f7898a3f6baa3fc2685c705e04c98c1094c523051c805cdd9306b8f87e

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 331d6d149fa9c7d632cde4490fb8bbb12337fa3a0232e77892be656464f4b446
MD5 a36ec74d141030cb02bb30c097f4e162
BLAKE2b-256 6b1632ce667f14c35537f5f605fe9bea3e415ea1b0a646389d2295ec348d5657

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 ba0a64f450b9ef412c98f6bcd2a50c6df6e2443b560024a09fa6a03189726879
MD5 9d9676b1cfdccdfe335e7b0e0f5ad221
BLAKE2b-256 d7a62c8486eef79671601ff57b093889a345dd3d576713ef047776015dc66de7

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e2ef6063d7a84994129732b47e7915e8710f27f99f3a3260b8a38fc7ccd083f4
MD5 a13adba89632f445a68714e56eeaa9f3
BLAKE2b-256 d0c6dc3a0db5a6766416c32c034286d7c2d406da1f498e4de04ab1b8959edd00

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 afd7594a56656ace95cdd6df4cca2e4059d294c5cfb1679c57824b605556cb2f
MD5 78237049f29facd707f16f7b06a8d4a1
BLAKE2b-256 930eccabc8d16ae4ba84a55d41345207c1e2ea88784651a5a487547d80851398

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 e7d094ae6312d69cc2a872b54b91b309f4f6fbce871ef28eb27b52a98e4d0214
MD5 ba8e3a308271a851322db845571ab1c5
BLAKE2b-256 d248dd2b3dac46bb9134a88e35d72e1aa4869579eacc1a27238f1577270773ff

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 d1cbe5454c9a15df1b3443c726063d930c16f047a3cc724b9e6e1a91140e5a21
MD5 270346deafe1dd76bdc537e017861622
BLAKE2b-256 17a9b38f4e74e0817af8f8ef925507c63c6ae8171e3c4cb2d5d4624bf58fca69

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 38201f15cdb1f8a6843e6563e6e79f4abd053394992b9bbdf5213ea3469b4ae7
MD5 98cb016500becee068a5f0f8a2eb0968
BLAKE2b-256 710bfbfecf42f67d9b7b80fde4aabb2b3110a97fac6585c9470b5bff103a80cb

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 19d2962dd28bc67c1f205ab180578a78eef89ac60ca7ef7cbe9635a46a56422a
MD5 0b8ed2486e68122af6d3fb724a6bdfc6
BLAKE2b-256 1e3b55e64befa1e7bfea963cf4b787b2cea1011362c4193f5477047532ce127e

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8d4e484f7b0827021ac5f9f71d4794aaef62b979ab7608593da22b1d2e3c4edc
MD5 46fd6e9ac59584a4e1901756d332d247
BLAKE2b-256 1c582aa8c874d02b974990e89ff95826a4852a8b2a273c7d1b4411cdd45a4565

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 59fdb013df17455e5f950b4b834a7b3ee2e0271e6378ccb33aa74d178b513c73
MD5 cf0c9623fb119572943e9cbde26eecfe
BLAKE2b-256 bf33f4b2d94ada7ab297328fc671fed209368ddb82f965ec2224eb1892674c3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page