Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.20.3rc0.tar.gz (343.8 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.20.3rc0-cp39-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9+ Windows x86-64

tokenizers-0.20.3rc0-cp39-abi3-win32.whl (2.2 MB view details)

Uploaded CPython 3.9+ Windows x86

tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_x86_64.whl (9.3 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ x86-64

tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_i686.whl (9.1 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ i686

tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_armv7l.whl (8.9 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARMv7l

tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_aarch64.whl (9.0 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARM64

tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ s390x

tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ppc64le

tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ i686

tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.8 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARMv7l

tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARM64

tokenizers-0.20.3rc0-cp39-abi3-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

tokenizers-0.20.3rc0-cp39-abi3-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.20.3rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.20.3rc0.tar.gz
  • Upload date:
  • Size: 343.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for tokenizers-0.20.3rc0.tar.gz
Algorithm Hash digest
SHA256 df4fbb4dcba540ff08542064c952954aa7c3047ce926745c43d0d877958354fc
MD5 d4152c6338e0d0894fce2368a84a5918
BLAKE2b-256 719c787243d765573d5d896a7c262bcfb31b53540df09399cbfde016ecb9f603

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 cbad047a7d4634670bce00d5c4518e15c02819e1d9e520f878dcdffcb576d9a4
MD5 1cfb10ae5fdf821fdf0ad45e74fcd9eb
BLAKE2b-256 2e4cb036de282531da16445ea127703222c1c32b46e19ff5b1f31e9ac46b4142

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 c5cc9cef9ec6be42e727d6e01638c02ec75db243b19d88114b0cee38f5da7db0
MD5 7b481ab8a353103e8c78fed78b9f37f8
BLAKE2b-256 6817f1be526b2ee9ec02f0589764f89edf066810dfcb8d89afef1f7082d454bb

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 641fcddaf586328b7a9225babb16d4335a94ba8f40ce20b90ed0ffe521b893be
MD5 856517f24b8987b93ecbc3d3cd9843bd
BLAKE2b-256 c108d39d5bd89a7e44e5c68fbebc6601c5016283dc6d29b75454dd203a85331b

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 e45e1ee8f31f7d74f20ae9cdcb678f3513a331d03986ce399422cc3959a9473c
MD5 3275feb39fa621c0f47e61ad7be3893b
BLAKE2b-256 640a38247ee89079ff87ff8f7c341851b70bf082e6fe6145e6dfdef7469d140e

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 1b7447e4e29b68b96709f60b8c6e3c05f0019a0db5c1f7014be5520b95a4eabd
MD5 a8e6af9379e42daf3929208b8fa7c808
BLAKE2b-256 27ac56f55d330abb7bdebc1bee6242e03e49b49401001b41d5313494f28ef6ea

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 5969ae008e83a7d1dffeecdf7c3f5c436e0198913763f268cb44090f808496a8
MD5 da33ec3a1ca77a1d0909b7678e5a930f
BLAKE2b-256 a92ff2f115afc2ad0b56d85af31687b460c8a4ad63cd21233a65131930e164eb

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 229f6a0a883fb531ec9d4f9c4f8e8458cdc6500ee25fb5287424e3de41d04b35
MD5 8f8e1cc62c0f475cd657db257888f2f3
BLAKE2b-256 aaca40dc572885df8fcfa8263f72850c8d21b0c6ad10d013c4c91d8c7f472f83

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 658e448552ce0297c32a15022ff4a47b57d92c935b48b346e5f38890e01dbdb8
MD5 9fe13e8a042a0b9bed28f9d585255e1f
BLAKE2b-256 2860cca64d6a764d385ed366869e91736bd15731c3773f18a5183eb004414aa1

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 6dccb398d7bd6c9088aefea9533f50b457e31b16e96b274aeb57cbfd4067bfa8
MD5 eb95921182a94e5cf20316c0904f6d5f
BLAKE2b-256 054fe48623d61b01ea60b52169049fe852136f74775f11aee0529e2891453490

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 8959d96a4b545434918fc0d61657ce6f06982b0688a06e4d164ec8053dad7c10
MD5 1c59ab517528fb830f024d02568f5144
BLAKE2b-256 79058ba144cf1202c68776fcb7a827a08396533e2114baa50946d8d107c06dd2

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 f8ada9ce1a5b4eb2a0c2878f7b437ce79d4547d31650d11e5bbb7a2580dbfd7b
MD5 1d696b5346a832064decd3deebd8a26d
BLAKE2b-256 9bf6699e88c71af25d6b5dbe243d386eec4261cf43a12d258a048c187f7a25cd

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 de7bc4434aad9776ed346667dea03755823b70facc310a3c29de847a54402385
MD5 debf8eb7fb787974b6088376a97fd4b1
BLAKE2b-256 79a7343b170465614ebe8ceb383a8a704df19caacf33095cc107dd029ac5a9ee

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2f036ed0059b1518900e8131ec40fc0d8608f18ec058a0204ec518c29cd5d259
MD5 9b60cc6034586fc6d4b0db8f046853d2
BLAKE2b-256 ad120f3d50188f4dd910cf745cf2e722569c5457297a2d00a39568df2e04a851

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.3rc0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.3rc0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 79e146f2761e0de993b169858fdafdea81a63b8d481a077805e762694c5ec06c
MD5 f9c454fca74b7aea4c0fc4bbe11081a9
BLAKE2b-256 9857eec65e3f5a906bd063675058a1c0e9183d5cf65e4196fecf0eb4300ab1f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page