Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.22.0.tar.gz (362.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.22.0-cp39-abi3-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.22.0-cp39-abi3-win32.whl (2.5 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.22.0-cp39-abi3-musllinux_1_2_x86_64.whl (9.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.22.0-cp39-abi3-musllinux_1_2_i686.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.22.0-cp39-abi3-musllinux_1_2_armv7l.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.22.0-cp39-abi3-musllinux_1_2_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.22.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.22.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.22.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.22.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.22.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.22.0-cp39-abi3-macosx_11_0_arm64.whl (2.9 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.22.0-cp39-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.22.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.22.0.tar.gz
  • Upload date:
  • Size: 362.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for tokenizers-0.22.0.tar.gz
Algorithm Hash digest
SHA256 2e33b98525be8453f355927f3cab312c36cd3e44f4d7e9e97da2fa94d0a49dcb
MD5 d3ddca93468d18d4f267591640068e5c
BLAKE2b-256 5eb4c1ce3699e81977da2ace8b16d2badfd42b060e7d33d75c4ccdbf9dc920fa

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c78174859eeaee96021f248a56c801e36bfb6bd5b067f2e95aa82445ca324f00
MD5 cddd037d34dcadf7fe9d385488d2e265
BLAKE2b-256 d19b0e0bf82214ee20231845b127aa4a8015936ad5a46779f30865d10e404167

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-win32.whl.

File metadata

  • Download URL: tokenizers-0.22.0-cp39-abi3-win32.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 dbf9d6851bddae3e046fedfb166f47743c1c7bd11c640f0691dd35ef0bcad3be
MD5 4836e46c51bbd9acb37245fbcd2d0a53
BLAKE2b-256 bf2483ee2b1dc76bfe05c3142e7d0ccdfe69f0ad2f1ebf6c726cea7f0874c0d0

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 da589a61cbfea18ae267723d6b029b84598dc8ca78db9951d8f5beff72d8507c
MD5 02054ba55d828a9b77ec5b5294fd53cd
BLAKE2b-256 14848aa9b4adfc4fbd09381e20a5bc6aa27040c9c09caa89988c01544e008d18

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 1626cb186e143720c62c6c6b5371e62bbc10af60481388c0da89bc903f37ea0c
MD5 d454668d799a11fe551e5b457c8ef7ae
BLAKE2b-256 ebf0342d80457aa1cda7654327460f69db0d69405af1e4c453f4dc6ca7c4a76e

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 76cf6757c73a10ef10bf06fa937c0ec7393d90432f543f49adc8cab3fb6f26cb
MD5 f43f4c3bb23c34fe85803f3631dd4f6f
BLAKE2b-256 a26292378eb1c2c565837ca3cb5f9569860d132ab9d195d7950c1ea2681dffd0

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 790bad50a1b59d4c21592f9c3cf5e5cf9c3c7ce7e1a23a739f13e01fb1be377a
MD5 71faccc08a7c5b9964236e8d902eee2f
BLAKE2b-256 bcd3498b4a8a8764cce0900af1add0f176ff24f475d4413d55b760b8cdf00893

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a89264e26f63c449d8cded9061adea7b5de53ba2346fc7e87311f7e4117c1cc8
MD5 b5a315b5477949c76b041814ccd6c320
BLAKE2b-256 d461aeab3402c26874b74bb67a7f2c4b569dde29b51032c5384db592e7b216f4

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 8337ca75d0731fc4860e6204cc24bb36a67d9736142aa06ed320943b50b1e7ed
MD5 b978b774da52afa163eca6832bad1c73
BLAKE2b-256 10e3b1726dbc1f03f757260fa21752e1921445b5bc350389a8314dd3338836db

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 cdf5954de3962a5fd9781dc12048d24a1a6f1f5df038c6e95db328cd22964206
MD5 e90ab9b7178e05a866460b66664b3220
BLAKE2b-256 aa279c9800eb6763683010a4851db4d1802d8cab9cec114c17056eccb4d4a6e0

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 4136e1558a9ef2e2f1de1555dcd573e1cbc4a320c1a06c4107a3d46dc8ac6e4b
MD5 5c982949139d3b5f6e905150bdcd9e32
BLAKE2b-256 5ad8bac9f3a7ef6dcceec206e3857c3b61bb16c6b702ed7ae49585f5bd85c0ef

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 ea8562fa7498850d02a16178105b58803ea825b50dc9094d60549a7ed63654bb
MD5 2f0580f23dace87b1919a00de9f7b7fa
BLAKE2b-256 138917514bd7ef4bf5bfff58e2b131cec0f8d5cea2b1c8ffe1050a2c8de88dbb

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ec5b71f668a8076802b0241a42387d48289f25435b86b769ae1837cad4172a17
MD5 9dec2d3f57bff218b5c0895db1a87618
BLAKE2b-256 5502d10185ba2fd8c2d111e124c9d92de398aee0264b35ce433f79fb8472f5d0

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 71784b9ab5bf0ff3075bceeb198149d2c5e068549c0d18fe32d06ba0deb63f79
MD5 596a19844d7935606eac80483f028fce
BLAKE2b-256 c202c3c454b641bd7c4f79e4464accfae9e7dfc913a777d2e561e168ae060362

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 eaa9620122a3fb99b943f864af95ed14c8dfc0f47afa3b404ac8c16b3f2bb484
MD5 a826315daf996f3f6304b23218b4857b
BLAKE2b-256 6db118c13648edabbe66baa85fe266a478a7931ddc0cd1ba618802eb7b8d9865

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page