Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.21.2rc0.tar.gz (351.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.21.2rc0-cp39-abi3-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.21.2rc0-cp39-abi3-win32.whl (2.3 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_x86_64.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_i686.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_armv7l.whl (9.1 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_aarch64.whl (9.1 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.21.2rc0-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.21.2rc0-cp39-abi3-macosx_10_12_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.21.2rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.21.2rc0.tar.gz
  • Upload date:
  • Size: 351.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.0

File hashes

Hashes for tokenizers-0.21.2rc0.tar.gz
Algorithm Hash digest
SHA256 6f24ab10b521db10ad8e4bf148a7a3555a1f413c8b28a3e9f6aed22b611405b3
MD5 cac4f2108965eff6fcd5e565831c5b70
BLAKE2b-256 e4d97a1243f4dc111ca4fae76278b62c671246a53c1640412448149498125f09

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c38f5d275305e020d338aa1e912c5786f6b63a3eafbf667cbbc1aa8fec89cbe7
MD5 bcb2c9855d912bd756640964decc1305
BLAKE2b-256 0397fd3692c208c7044f91a2b19d961204ae097afc236374e317d6c1fb21ac34

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 6972dbab77f73cfdb44ed149220652a09c59d5c3101069bc95d3b2190e409f09
MD5 b36862fcd4bfd36348c1c2ea53ade502
BLAKE2b-256 fabc219099bc8d4df79aabac99d3610e9d02899c75ab88f2dd8fa7da72c395f0

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 8a993678b153e83b13d641a1a4c7783949fb29e624ed48f2a2c948bd581b621f
MD5 082b3b5822d3a1d845cc3bb651bc58aa
BLAKE2b-256 57db6b1d008df5a7b09ea426c510b5c4fbead93ec5fa4c5c322a1c0c7aed5c76

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 6db54b870aa285f92ead4130d723c4da72d8d70fe52dcc33f3707025c0e4cf6f
MD5 1791fd911bef0e2bbafc5ef23981fcc9
BLAKE2b-256 72348ff112bd061625a73535dc27f931e2ac269361a33a7f385dc13610ca33e6

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 9ad2ed6d34895563268356d84eca6687c197aaaf6b5309ab125ce68741edc7cb
MD5 06636b7094e0d1de1eb877b34085c918
BLAKE2b-256 205122da0e2c2c75dc85848c7f99efa6dc91f8833359eb543ea09d40227d8e15

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 cbb4b50523f984cf63ca6e919ee4758cf7df67248fd65339ba13b75993848817
MD5 25caea7b2237d09a69cda4d63e8ef2f0
BLAKE2b-256 bbb1a4a7ac8526d83efcc40bce253b88e82b4c99dcc3fc8574918d9f56268af6

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6bb89f6a5c32c20d19d96d306e27760032db5932587b03e47be48ea5e6de8868
MD5 ac314f299a3bcb414cfc32b459378ec4
BLAKE2b-256 8bb161ec98e1b9bcb1d768b7c6b63e3e25db53250228c58e0c52b6641b67f3fe

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 35ae904cba77eb8c88f87f29a15d574e2f3c0c0b28c04e5afef46003e3e8f7df
MD5 327a84dff2eef41dc2d57d75b8206de7
BLAKE2b-256 b8771d635a02174bd658f69868ddf365312a627f2af0090fb32126757d767a31

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 b4cee7dbadb174877fa35a9b353799cb3d650f360dd067fbb0875c193e4ee5d4
MD5 92ed295d1f742483c513a1f701717df3
BLAKE2b-256 148e7db5c29486fe53362ae57786dc4da7dfc4f54371e977f8dab6815d78abfe

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 41d65e06f48254b9d5a8f844018d1377e0847df377c5fb11918b9a50eb9df62d
MD5 eff2a137e8f702391d1ffca3ac63d48c
BLAKE2b-256 4b294dc0ce126b6091146fb8be8d58edc588de6467ca0cf042064c851ee33745

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 8c821d14f61af62a8b451180c0c21e6dbcc6a5071f9c3ae21e9337d0310a0f11
MD5 864461ac1f6b828f22de6a3aabb29ce3
BLAKE2b-256 cc2de8de0264de4a8e02027af605fcc556777ab2d1dd5800cedcdc1ccff6090b

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 da11de56a06d35c75ca07a2673149fbd35bbc866acee55b05dba9fff5346b821
MD5 f81aeb9c9175bec823737d789d368e43
BLAKE2b-256 7cd6a73945bc94bb1a5e3ad5f0b5af4847ef906574a911a1d6c2b84e54fd18cc

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 14a74efeca468883d686761898e45e7389816fa6064aad14579bd231edacdad3
MD5 8c32920fa8edd44464083f132c6d70ec
BLAKE2b-256 87da4163d6aaa658ed0ef31bd613fc504e7a84429cd825a8186afe6836e04fc7

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.2rc0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.2rc0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f2fb5d8816d98ebe5dfe0be422c08d54f70efd5b60f72a21feeeeb54747d7603
MD5 64de97eaa65e8de98be0587c28c2ae5c
BLAKE2b-256 7449eeb30eb74f8160f32a6398b013916c8315d6edd0b68356cadb1d6d051ee0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page