Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.22.1rc0.tar.gz (363.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.22.1rc0-cp39-abi3-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.22.1rc0-cp39-abi3-win32.whl (2.5 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_x86_64.whl (9.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_i686.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_armv7l.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.22.1rc0-cp39-abi3-macosx_11_0_arm64.whl (2.9 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.22.1rc0-cp39-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.22.1rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.22.1rc0.tar.gz
  • Upload date:
  • Size: 363.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for tokenizers-0.22.1rc0.tar.gz
Algorithm Hash digest
SHA256 dcd3483fbc9279751e0fa5e57536849d303c2c92b54840b57644ed606fddced4
MD5 d97bf6c2de9e87e3a009e95a2c9b20dc
BLAKE2b-256 af8d3dc7c7cdec41557b65c1ddde9d946801a52672e9c0e88ad334d10d9ca3d5

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a13a8b65ef8261132cd7a2355ca9a7a153ad06db6bc3c0450966dffe6ee46150
MD5 5d6f6f8841b08570aea057923038b9b7
BLAKE2b-256 a4f60c5cdbddf7ef5c03816757535d927ac4ea60274449e4131ecfa4f78ead58

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 ae39c990fca09201a68573b7677029bc3d0c218f4d751f59940f4013b5df7774
MD5 a448832dda93f7532cfd6dcb16a2d706
BLAKE2b-256 66039c4cbde6171af48e936d51bccb60547538459a3a8b09727f86eaa0fe2936

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 76cdb97ed6b6d2c435529b1bd026203018328b4987af62d3eaa6a83b0121eded
MD5 3dbf22a81a23ef76b67f35d80c118e83
BLAKE2b-256 bd85082637e9bfe23ed69c2fd050af313815cdfc1ebbf135fd69ee55c9597f54

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 04ed3b075b98e210807fdf9c4b9c7b1319d5d5f0f51e0799505d396b14179858
MD5 5ce2c9301b535fcb1712622b4e58703e
BLAKE2b-256 bf35840324bfc16fd93560e0ab573b3a70c8794156e1d0786be5ad4867f79e16

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 f568c569ec1d35b91dec6e73e7c09bb2f4adef4c1b411a9aa1e32420a1a821b1
MD5 3095cbe871352b80f39b5998e5954481
BLAKE2b-256 83425f8075ed1d5fa04032ccec818b8e862b8966d46bc32db58ebb43a08971d3

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 725f83def2dc73512c0d42af29f7450e73a1d982d766364ad0d8107f799e066e
MD5 57ac585553156e5f63cd5de86612bbd4
BLAKE2b-256 ce98a22592275fc4ed35c96c2daf68955af590d9b485cd78cfbb0a6cbba94541

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4cb44bdecbeccac51248dd4bc5663988f358c4b6ce7f88921ff1d11871696ce
MD5 7c00a32341043cefc95c281bc692db18
BLAKE2b-256 9e8e7c7a30ef000b21954438ac3ec9e761fa3a4c343b6d6cf913f7a0644e067b

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 3d3d00e280acb04fc142686fe77c0dec007805c8b5b8427da58b1871fdf383d4
MD5 3d9880c895a7c0067fea38ba5dd3eb01
BLAKE2b-256 09505dba359a81c583eef7aa78a6de834901245eb64818bcc6c4277f2150bc09

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 acbc224eefe273c56aacb0e9fcdc074b4c674517cbbdc23d04b2845848679271
MD5 d0d15d028b30feaeb7662cf0ce7e760b
BLAKE2b-256 616e195d3a5e8acbebc7ee90ee46febba67dd4e5af28d92577157db3e1a13a36

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 27d1792246256de8bc1e64c91e1ac799566a4ee57ce6cb271c5be4f445bf7aef
MD5 b056da4143b7ff199e783e3d6ff41ad6
BLAKE2b-256 2b2b110be6f6a3284970b1067d5b2f0b6a1463e5f26d54eecee15ce388f39202

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 e18221372dfed268f45d7551a8317243cb7f3e4c91b095733360df5f39d6c890
MD5 22467b5e527e0ba8d94f2d9e86da124d
BLAKE2b-256 98f85aaa8d1cbb708f5e58c1d4f89fd06ed40e11250d3795bf17f42a4dacac29

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 93b14955301925b572ee925901ae80422dbf0e7167eec3f155c844430413270f
MD5 5ed8eb56621566f418bc8141fd72a07d
BLAKE2b-256 a9431945d93b9718991c0dd021ea005349048549e35be83711bac7a37a08f25b

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d456b68bdb365abac4a1526235d521f1776ec2372a7dd92b8953a5c5b9d76876
MD5 a99412f7dbc9b220b85046204031411a
BLAKE2b-256 4ac6ed28a11d16606762426196dbe754666bd5bbdbd97bcac2bca1292f7c042c

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.1rc0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.1rc0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f88948387453c60a12aa53667516913bb21cb07eba18dc04eb9d3674bc0242a1
MD5 9085141654e47223f014187350306fa9
BLAKE2b-256 435b3285cf0234a0774273dc66cbc1f1e4848fbf23938d7e72dc8720bffec726

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page