Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.21.1rc0.tar.gz (343.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.21.1rc0-cp39-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.21.1rc0-cp39-abi3-win32.whl (2.2 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_x86_64.whl (9.4 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_i686.whl (9.2 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_armv7l.whl (8.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_aarch64.whl (9.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.21.1rc0-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.21.1rc0-cp39-abi3-macosx_10_12_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.21.1rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.21.1rc0.tar.gz
  • Upload date:
  • Size: 343.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.2

File hashes

Hashes for tokenizers-0.21.1rc0.tar.gz
Algorithm Hash digest
SHA256 7e79aa4e67ee9ad85eb3f919104e9d2b96ac2c7ad923d950555204c15cdbbb09
MD5 be967b72c65bedb7f1ced543633641d3
BLAKE2b-256 97265299b40d36ce04349278e3c933c195fac4fa3a505e163936f11ee6e63ff6

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 656db124f4d3bea2fedf3475ad4a1a1350e45e188a3d729a94bea64bc11c815b
MD5 04ab89d7191770e6d14e11761acecbae
BLAKE2b-256 419144d8b7723dbcfe60623d2301983e5f72d1548207b064f685d1122274b819

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 3679f69c60020e8c494d0d2a759e58a841600bb719acee6ff9451c3c89b773d2
MD5 918c39bbed72dfe61ee5029334c74c4f
BLAKE2b-256 cf0b20717fa7f19dbfd09c63bae26f0915a0ba8fc1a960a9460fc2a5e88ea059

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 52a6dec8d2ca15c558aa10c30149f294a67935864f2c94c76ee0f0dbaf86a58b
MD5 266333c4ff910fa2cf0649afbb5c3539
BLAKE2b-256 274ef6325d2da5be005ce2ab49776a39b3faef5b878b27ae8d276114e2c5a90f

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 ec9b8b3589a9e36640fb0d408ed97c8338ebb0783daa6f777eaf5f2ba43d63b2
MD5 69674c84919e20fe47149b5138cf9eeb
BLAKE2b-256 dddfa11d94e9c37e352d7048c92087d7f236f9b43d2675c6434b3d9937b7a50b

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 4547b51fdc9a36be83320ffa31aa7652ad84e5816ddeb7023613a5311f0881a9
MD5 86d3cf91aadf33a1e26fd92eff2820ad
BLAKE2b-256 c5d5fd554da517d57fbb364b682752f112a54713e95073fd4bbc9df90776cb83

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 6f922b76e080c6a47ae8a91a201523effc0de7e1e2db18f81d12ed7f83923465
MD5 a08d7cb60ce36a7f91d0e68fb43fdd26
BLAKE2b-256 a96432dfcbca06357572ca250163d4a537e402fef5b8aab4232096e8bba23cea

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3a812c518779f827f0a9e4a6cec5ed21a42ac92eba05dacf318bf41ab2401031
MD5 869d0f44760d092223a883ea1f2ef9bf
BLAKE2b-256 f97325ec58997c2bdcc549c2ed039a79ccff91a03726985ad927c8267b003a9b

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 01939af5f54f5754c45ecc196668302d1b5ab4121035e749e0eda3338de9b902
MD5 c8dfa7d04a1e1b652f680c1ca1a36cfe
BLAKE2b-256 6ce61bc23356d02655c94da249ee53846f591bb9fdd88c9113d5dd1bc93b6a9d

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 870bf139faa1bdb431c436aa37caa688533a0a7ecec962eaef2ec299977178a1
MD5 5eb10faee6b357f1d0f527a91a7405f7
BLAKE2b-256 63d8e9ad15612a733c7b354b56dd7638ba0a33902ef155339e705c7bd8960dd7

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 af1787bc98879601266c23733f13181315ff90313a5e5605bfb4f0673fb4bee8
MD5 7cd6dddef43c3c7f7a1b45aa85cc80b6
BLAKE2b-256 3aee4cac7757e6b1f851989acbdc831bd633e3249a2e8aef6f3d452e4038717d

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 01d41264b0b0153ee010334f011fb723de2f7f3d2eead60edbd19205e308b311
MD5 7f16405edd8df48afc6ff5edf234ebc5
BLAKE2b-256 f47632ae7138aa38eefd761faf8d644bf5a821290b3342341f7dc3c4a3106861

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 72c929a70f4fa2e51c8eecaae9256ad86950e8282eb09a40bf4836cb1181419e
MD5 e9e39c6ebd434a5b491244ccb4ed90a1
BLAKE2b-256 d953a525a7059a4af43f71ba8ea9643656e555de3df7dfbe2d06066e4c30e4c9

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 68af20001010d8e3eb1406c4beff77c6981deb6be047cc98cce647b578747b7c
MD5 1cfaae6ae649daab04b4e173654c34f1
BLAKE2b-256 670654b53314a4d34d1009756dd34a3a730ae450413dbb94a6d61772bdfc2327

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1rc0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1rc0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4a7cac28242218291f4523930d8852dfa76e3b7c934ec0e26724cfc7ef6cbdf1
MD5 5bcaa274c242f5ca150a9a74bd83e3d6
BLAKE2b-256 f98d3ea43ae8535e51ed30f64750aab45379d8265c35a6adafb2c13a7e98a899

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page