Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

File details

Details for the file tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 c0990fd2583b16e3e481b4f670e2d54ad6d4e5c09974cbc3972b9ffa126663bc
MD5 7fc4c9df9f438161a59a761ab42e85ea
BLAKE2b-256 32530c4b57458fbc98cc0b158816ff097c2e58dda8941150cfe358f91cee6ef4

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 12343421b997126b013ef4750c172c52b7209fe5c5659756f4732f7912b269af
MD5 2ea0f5c4b2d8b78b59347382a7193143
BLAKE2b-256 80430e4daa817d647cfab00efd9153e1c40e3e2deaa3759e42eeed5fea6df2ac

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 6936e8b4bff1acedc84413872a8aa043e9345139fd6a1889193b63ba143cf2c7
MD5 4778e940e331092abc822a649acd6359
BLAKE2b-256 ba17fc546806ce8c5d5d82d63d07df776ec090d43906780b7756de6e90be6a09

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 88a6d1d2f20e484b5902f55a41ceae1e42be61a869ffa22b75811d8758424305
MD5 9686f913eb9c39af0acb02bff6fd8559
BLAKE2b-256 7acbda5b5c27e820e4b380702eb53a63ab73ffa661408d685827a0b53f23b978

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page