Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.21.4.tar.gz (351.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.21.4-cp39-abi3-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.21.4-cp39-abi3-win32.whl (2.3 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.21.4-cp39-abi3-musllinux_1_2_x86_64.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.21.4-cp39-abi3-musllinux_1_2_i686.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.21.4-cp39-abi3-musllinux_1_2_armv7l.whl (9.1 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.21.4-cp39-abi3-musllinux_1_2_aarch64.whl (9.1 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.21.4-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.21.4-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.21.4-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.21.4-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.21.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.21.4-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.21.4-cp39-abi3-macosx_10_12_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.21.4.tar.gz.

File metadata

  • Download URL: tokenizers-0.21.4.tar.gz
  • Upload date:
  • Size: 351.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.2

File hashes

Hashes for tokenizers-0.21.4.tar.gz
Algorithm Hash digest
SHA256 fa23f85fbc9a02ec5c6978da172cdcbac23498c3ca9f3645c5c68740ac007880
MD5 21c2f42d341a85ac94c5483173f1fb29
BLAKE2b-256 c22f402986d0823f8d7ca139d969af2917fefaa9b947d1fb32f6168c509f2492

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 475d807a5c3eb72c59ad9b5fcdb254f6e17f53dfcbb9903233b0dfa9c943b597
MD5 7ae8cd543d54c723556bfd57ba33fc1e
BLAKE2b-256 41f2fd673d979185f5dcbac4be7d09461cbb99751554ffb6718d0013af8604cb

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-win32.whl.

File metadata

  • Download URL: tokenizers-0.21.4-cp39-abi3-win32.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.9+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.2

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 6c42a930bc5f4c47f4ea775c91de47d27910881902b0f20e4990ebe045a415d0
MD5 b7fab9a80e52406aea2b80d446338c10
BLAKE2b-256 3dd3dacccd834404cd71b5c334882f3ba40331ad2120e69ded32cf5fda9a7436

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 c212aa4e45ec0bb5274b16b6f31dd3f1c41944025c2358faaa5782c754e84c24
MD5 c7777a140578d45893bfef0b339daeea
BLAKE2b-256 b70a42348c995c67e2e6e5c89ffb9cfd68507cbaeb84ff39c49ee6e0a6dd0fd2

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 3c1f4317576e465ac9ef0d165b247825a2a4078bcd01cba6b54b867bdf9fdd8b
MD5 55ea0709c67fef7f4f5b1663e7f58db2
BLAKE2b-256 212bb410d6e9021c4b7ddb57248304dc817c4d4970b73b6ee343674914701197

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 1340ff877ceedfa937544b7d79f5b7becf33a4cfb58f89b3b49927004ef66f78
MD5 4b6eeb98584860e5db3de94924e0e084
BLAKE2b-256 44a1dd23edd6271d4dca788e5200a807b49ec3e6987815cd9d0a07ad9c96c7c2

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 714b05b2e1af1288bd1bc56ce496c4cebb64a20d158ee802887757791191e6e2
MD5 802e12a3519f9947b5c0b470668f0f3c
BLAKE2b-256 9143c640d5a07e95f1cf9d2c92501f20a25f179ac53a4f71e1489a3dcfcc67ee

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 51b7eabb104f46c1c50b486520555715457ae833d5aee9ff6ae853d1130506ff
MD5 afbbf5814483f710430ec07532e07e59
BLAKE2b-256 f290273b6c7ec78af547694eddeea9e05de771278bd20476525ab930cecaf7d8

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 cc88bb34e23a54cc42713d6d98af5f1bf79c07653d24fe984d2d695ba2c922a2
MD5 cb5ff4b1da97b33bad9b11662bae4011
BLAKE2b-256 75c796c1cc780e6ca7f01a57c13235dd05b7bc1c0f3588512ebe9d1331b5f5ae

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 f23186c40395fc390d27f519679a58023f368a0aad234af145e0f39ad1212732
MD5 52b587891d0f9a787b29ba39e9a31f82
BLAKE2b-256 bebf98cb4b9c3c4afd8be89cfa6423704337dc20b73eb4180397a6e0d456c334

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 3c73012da95afafdf235ba80047699df4384fdc481527448a078ffd00e45a7d9
MD5 cea63971458769bf725bfcbcc7b2f6d7
BLAKE2b-256 996fcc300fea5db2ab5ddc2c8aea5757a27b89c84469899710c3aeddc1d39801

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 2107ad649e2cda4488d41dfd031469e9da3fcbfd6183e74e4958fa729ffbf9c6
MD5 bda79479abdbd696117a14b6746fd5d6
BLAKE2b-256 584726358925717687a58cb74d7a508de96649544fad5778f0cd9827398dc499

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 39b376f5a1aee67b4d29032ee85511bbd1b99007ec735f7f35c8a2eb104eade5
MD5 73be8b742e6c6161d8b354edd962c0b1
BLAKE2b-256 aa8f24f39d7b5c726b7b0be95dca04f344df278a3fe3a4deb15a975d194cbb32

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5e2f601a8e0cd5be5cc7506b20a79112370b9b3e9cb5f13f68ab11acd6ca7d60
MD5 b655b5a940b61fde4455ea28bb69213a
BLAKE2b-256 8da628975479e35ddc751dc1ddc97b9b69bf7fcf074db31548aab37f8116674c

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.4-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.4-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2ccc10a7c3bcefe0f242867dc914fc1226ee44321eb618cfe3019b5df3400133
MD5 e13399164d7b306c5cd2cac10155b3f6
BLAKE2b-256 98c6fdb6f72bf6454f52eb4a2510be7fb0f614e541a2554d6210e370d85efff4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page