Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.21.2.tar.gz (351.5 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.21.2-cp39-abi3-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.21.2-cp39-abi3-win32.whl (2.3 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.21.2-cp39-abi3-musllinux_1_2_x86_64.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.21.2-cp39-abi3-musllinux_1_2_i686.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.21.2-cp39-abi3-musllinux_1_2_armv7l.whl (9.1 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.21.2-cp39-abi3-musllinux_1_2_aarch64.whl (9.1 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.21.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.21.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.21.2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.21.2-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.21.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.21.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.21.2-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.21.2-cp39-abi3-macosx_10_12_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.21.2.tar.gz.

File metadata

  • Download URL: tokenizers-0.21.2.tar.gz
  • Upload date:
  • Size: 351.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.0

File hashes

Hashes for tokenizers-0.21.2.tar.gz
Algorithm Hash digest
SHA256 fdc7cffde3e2113ba0e6cc7318c40e3438a4d74bbc62bf04bcc63bdfb082ac77
MD5 bab3398f4c622a2628e68b2511ef6b3d
BLAKE2b-256 ab2db0fce2b8201635f60e8c95990080f58461cc9ca3d5026de2e900f38a7f21

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 58747bb898acdb1007f37a7bbe614346e98dc28708ffb66a3fd50ce169ac6c98
MD5 5fa3b436e0cbcf3cbd1b31aa77d062dc
BLAKE2b-256 13c3cc2755ee10be859c4338c962a35b9a663788c0c0b50c0bdd8078fb6870cf

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-win32.whl.

File metadata

  • Download URL: tokenizers-0.21.2-cp39-abi3-win32.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.9+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.0

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 cabda5a6d15d620b6dfe711e1af52205266d05b379ea85a8a301b3593c60e962
MD5 716a8c81d443a5dfe1b807b145bd31ed
BLAKE2b-256 d8a5896e1ef0707212745ae9f37e84c7d50269411aef2e9ccd0de63623feecdf

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 106746e8aa9014a12109e58d540ad5465b4c183768ea96c03cbc24c44d329958
MD5 e0918e896c63d212b027853dac8e727a
BLAKE2b-256 a4d2faa1acac3f96a7427866e94ed4289949b2524f0c1878512516567d80563c

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 0e73770507e65a0e0e2a1affd6b03c36e3bc4377bd10c9ccf51a82c77c0fe365
MD5 bf65879c181f34f25f3f9b4727763636
BLAKE2b-256 637b5440bf203b2a5358f074408f7f9c42884849cd9972879e10ee6b7a8c3b3d

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 ed21dc7e624e4220e21758b2e62893be7101453525e3d23264081c9ef9a6d00d
MD5 91f14070c84924b6d1a81a35d27f3752
BLAKE2b-256 6cbdac386d79c4ef20dc6f39c4706640c24823dca7ebb6f703bfe6b5f0292d88

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 2c41862df3d873665ec78b6be36fcc30a26e3d4902e9dd8608ed61d49a48bc19
MD5 9f8917a01196c1a317bfd5892dca7f26
BLAKE2b-256 3c6abc220a11a17e5d07b0dfb3b5c628621d4dcc084bccd27cfaead659963016

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fed9a4d51c395103ad24f8e7eb976811c57fbec2af9f133df471afcd922e5020
MD5 1687c605e98622a6c7e8ed1fa00e00bf
BLAKE2b-256 c574f41a432a0733f61f3d21b288de6dfa78f7acff309c6f0f323b2833e9189f

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 b1b9405822527ec1e0f7d8d2fdb287a5730c3a6518189c968254a8441b21faae
MD5 9c7df6206897936f07d37a258e1fbfd4
BLAKE2b-256 385f959f3a8756fc9396aeb704292777b84f02a5c6f25c3fc3ba7530db5feb2c

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 514cd43045c5d546f01142ff9c79a96ea69e4b5cda09e3027708cb2e6d5762ab
MD5 d7b25e6086fa863a9f3d03aceba62273
BLAKE2b-256 001579713359f4037aa8f4d1f06ffca35312ac83629da062670e8830917e2153

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 5e9944e61239b083a41cf8fc42802f855e1dca0f499196df37a8ce219abac6eb
MD5 1a145aeaaa396b41989a1a7aeb30ac6b
BLAKE2b-256 a52e53e8fd053e1f3ffbe579ca5f9546f35ac67cf0039ed357ad7ec57f5f5af0

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 8bd8999538c405133c2ab999b83b17c08b7fc1b48c1ada2469964605a709ef91
MD5 7115df53491d926cb64f8085b2222474
BLAKE2b-256 0515fd2d8104faa9f86ac68748e6f7ece0b5eb7983c7efc3a2c197cb98c99030

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4a32cd81be21168bd0d6a0f0962d60177c447a1aa1b1e48fa6ec9fc728ee0b12
MD5 d736db2bee4be16feca2e85f5af87362
BLAKE2b-256 332b1791eb329c07122a75b01035b1a3aa22ad139f3ce0ece1b059b506d9d9de

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 126df3205d6f3a93fea80c7a8a266a78c1bd8dd2fe043386bafdd7736a23e45f
MD5 5debc4095b99f43c621b2eae897294f8
BLAKE2b-256 6ce633f41f2cc7861faeba8988e7a77601407bf1d9d28fc79c5903f8f77df587

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 342b5dfb75009f2255ab8dec0041287260fed5ce00c323eb6bab639066fef8ec
MD5 f1053c3a3bec21e779a5b9069f9cd69f
BLAKE2b-256 1dcc2936e2d45ceb130a21d929743f1e9897514691bec123203e10837972296f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page