Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.21.1.tar.gz (343.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.21.1-cp39-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

tokenizers-0.21.1-cp39-abi3-win32.whl (2.2 MB view details)

Uploaded CPython 3.9+Windows x86

tokenizers-0.21.1-cp39-abi3-musllinux_1_2_x86_64.whl (9.4 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.21.1-cp39-abi3-musllinux_1_2_i686.whl (9.2 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.21.1-cp39-abi3-musllinux_1_2_armv7l.whl (8.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.21.1-cp39-abi3-musllinux_1_2_aarch64.whl (9.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.21.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.21.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.21.1-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.21.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.21.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.21.1-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.21.1-cp39-abi3-macosx_10_12_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.21.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.21.1.tar.gz
  • Upload date:
  • Size: 343.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.2

File hashes

Hashes for tokenizers-0.21.1.tar.gz
Algorithm Hash digest
SHA256 a1bb04dc5b448985f86ecd4b05407f5a8d97cb2c0532199b2a302a604a0165ab
MD5 78ab2e5ec41e5648fc73ba5b880816f9
BLAKE2b-256 92765ac0c97f1117b91b7eb7323dcd61af80d72f790b4df71249a7850c195f30

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0f0dcbcc9f6e13e675a66d7a5f2f225a736745ce484c1a4e07476a89ccdad382
MD5 de6b7bb0bf0709c362ce72e64fc99518
BLAKE2b-256 e6b6072a8e053ae600dcc2ac0da81a23548e3b523301a442a6ca900e92ac35be

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-win32.whl.

File metadata

  • Download URL: tokenizers-0.21.1-cp39-abi3-win32.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.9+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.2

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 1039a3a5734944e09de1d48761ade94e00d0fa760c0e0551151d4dd851ba63e3
MD5 a90e4f59a9fb4b428071efca1471da3e
BLAKE2b-256 e85ba5d98c89f747455e8b7a9504910c865d5e51da55e825a7ae641fb5ff0a58

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e5a69c1a4496b81a5ee5d2c1f3f7fbdf95e90a0196101b0ee89ed9956b8a168f
MD5 528e83e92f9d81ffafaee02e14db8ad4
BLAKE2b-256 5faa8ae85f69a9f6012c6f8011c6f4aa1c96154c816e9eea2e1b758601157833

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 9ac78b12e541d4ce67b4dfd970e44c060a2147b9b2a21f509566d556a509c67d
MD5 fd233c9e02324a91f931fb3c9e91109f
BLAKE2b-256 ac33ff08f50e6d615eb180a4a328c65907feb6ded0b8f990ec923969759dc379

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 ed248ab5279e601a30a4d67bdb897ecbe955a50f1e7bb62bd99f07dd11c2f5b6
MD5 02637f96181886df0150bfbf96849bc5
BLAKE2b-256 aeb30e1a37d4f84c0f014d43701c11eb8072704f6efe8d8fc2dcdb79c47d76de

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 db9484aeb2e200c43b915a1a0150ea885e35f357a5a8fabf7373af333dcc8dbf
MD5 c0aabcfc6a0bd6f452b3dccd623f357d
BLAKE2b-256 ec83afa94193c09246417c23a3c75a8a0a96bf44ab5630a3015538d0c316dd4b

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2dd9a0061e403546f7377df940e866c3e678d7d4e9643d0461ea442b4f89e61a
MD5 0364da4a197b81b040b9704d1018441a
BLAKE2b-256 8a6338be071b0c8e06840bc6046991636bcb30c27f6bb1e670f4f4bc87cf49cc

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 2fdbd4c067c60a0ac7eca14b6bd18a5bebace54eb757c706b47ea93204f7a37c
MD5 f005bb89604788e2a1a7810169bcb22e
BLAKE2b-256 d81b2bd062adeb7c7511b847b32e356024980c0ffcf35f28947792c2d8ad2288

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 a21a15d5c8e603331b8a59548bbe113564136dc0f5ad8306dd5033459a226da0
MD5 3e26e141281d37e17b1f897264e44258
BLAKE2b-256 a44d8fbc203838b3d26269f944a89459d94c858f5b3f9a9b6ee9728cdcf69161

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 aaa852d23e125b73d283c98f007e06d4595732104b65402f46e8ef24b588d9f8
MD5 4d1a1d698a6ab01f9d127ab15e5d53a9
BLAKE2b-256 36aa3626dfa09a0ecc5b57a8c58eeaeb7dd7ca9a37ad9dd681edab5acd55764c

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 34d8cfde551c9916cb92014e040806122295a6800914bab5865deb85623931cf
MD5 a320ba5f6f52fa7321ccd3b6b6db1f19
BLAKE2b-256 3c1eb788b50ffc6191e0b1fc2b0d49df8cff16fe415302e5ceb89f619d12c5bc

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 28da6b72d4fb14ee200a1bd386ff74ade8992d7f725f2bde2c495a9a98cf4d9f
MD5 79484af41b51b72935d30d3084ead9f5
BLAKE2b-256 4d7aa209b29f971a9fdc1da86f917fe4524564924db50d13f0724feed37b2a4d

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cd51cd0a91ecc801633829fcd1fda9cf8682ed3477c6243b9a095539de4aecf3
MD5 5d1b6e4f8888ed6c7fa04470e79fb5b5
BLAKE2b-256 ae1a4526797f3719b0287853f12c5ad563a9be09d446c44ac784cdd7c50f76ab

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e78e413e9e668ad790a29456e677d9d3aa50a9ad311a40905d6861ba7692cf41
MD5 198610cf6b7de13823dfa20df688207b
BLAKE2b-256 a51f328aee25f9115bf04262e8b4e5a2050b7b7cf44b59c74e982db7270c7f30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page