Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Typing support and stub.py

The compiled PyO3 extension does not expose type annotations, so editors and type checkers would otherwise see most objects as Any. The stub.py helper walks the loaded extension modules, renders .pyi stub files (plus minimal forwarding __init__.py shims), and formats them so that tools like mypy/pyright can understand the public API. Run python stub.py whenever you change the Python-visible surface to keep the generated stubs in sync.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.22.2rc0.tar.gz (372.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ s390x

tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ppc64le

tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARMv7l

tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ s390x

tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.7 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ppc64le

tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARMv7l

tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.3 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

tokenizers-0.22.2rc0-cp39-abi3-win_arm64.whl (2.6 MB view details)

Uploaded CPython 3.9+Windows ARM64

tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_i686.whl (9.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_armv7l.whl (9.4 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_aarch64.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

tokenizers-0.22.2rc0-cp39-abi3-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

tokenizers-0.22.2rc0-cp39-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.22.2rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.22.2rc0.tar.gz
  • Upload date:
  • Size: 372.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for tokenizers-0.22.2rc0.tar.gz
Algorithm Hash digest
SHA256 0d326c384629aa264e20fc8a71394fe8dc92b0cda22b70414e2e15d1a4afa9d0
MD5 506015361d438e3b0e7ee2607f1b792f
BLAKE2b-256 c3619dd40a3d0cce31b2089fc08d6a815293b52af5cfb34fd9d5573492f7738f

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 dc3d3915413efa3e8bc8f2eba01865a8b5f0828e26deda150e65108ba126cc63
MD5 7400e5aecc90c5a4d8d8d65171b75013
BLAKE2b-256 4d820048e98011f750178c8d78008b863fe53dedac8c692fd96d115bb17bf866

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 531fa60651739a4afabec2e93f740bd2a8489f16be43b4dd0de492c3e1c55233
MD5 42c537546d2d40e3d59a8ad939272622
BLAKE2b-256 948ce9686affd601c03ca8007c8a7eeb99796fb398c1d595e9658e67f49a9d32

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 83198993fde30b4536921552218316cd1d433e25db3dfabe515cf3acf827e13c
MD5 3401537764c5195ad941c1dd286661bd
BLAKE2b-256 af000825268e98ff784cda88642026a4c82b6025d1016b23836389157a8f580f

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 965228ca7eb323a7d49dbe20a382ff09c7945dbc30015178eafa7d5b919a6573
MD5 c0d8d200c7e28aafa6d8d590958fd2fd
BLAKE2b-256 4451e2fb5226e1d4157c4b2f328782caff938e6dc05f5534adb6bd432a53ed06

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 59d26aa7d2eae2d7fb93136d3e1efc0740cadfafeb0f1acf27ad8f44afe244b1
MD5 5456b5f8283c2b3bccd97e9fbfc7472d
BLAKE2b-256 6ace0db99439ee7914439cad05974b19baa83fadd2e0dfc344d9f9a8225743cb

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 3713c73f7f9980dccc23b5952e8f447a2b3b669dabcd67cb66a4ddb1340dda22
MD5 9c5d20b0782a366b328a00d7eebac508
BLAKE2b-256 093ed97712afa7adca7c3619007135c868ae194d9ac89c6f03ac5dc83be09e94

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 3edbfc28907178ff310ae8a7279f29af5deaf54dbdc1003337dafc1be6ea558f
MD5 924c7bdd43fe138da26e666c9f9abffc
BLAKE2b-256 3605e8c92ca7c5335c4d433eda80f331b35da1d4951214d0b0c24a25a03a89c7

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6eb3731a014fbb488c4e58671cd98ca767be83b96e391909896b61649b272c5a
MD5 82437634fd86de2856b8c02bb5b2e255
BLAKE2b-256 81bf063991e3e6ce4900b767ca974a515181a743ffe5d5a6f5e7f245cd0b3d05

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 1a450c173111840faa073ae35907cf69d119c2f1d5d1777f6711a3970723c544
MD5 44a699292bf90357e31d11e289523dde
BLAKE2b-256 76cb09926845a57af4d0475f4240f832f97dca91a4809dc62bf0926f3ba94164

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 43fc28dd7dcfb9f0a622ba022b3f150fdd28985155e4c69792d06c9ce6799942
MD5 cae35035269fbeb8e3f6cd9abb2b5e6e
BLAKE2b-256 e075c9f5fb670b76a7bd8878b1a063a7a997d1a67e061a0108da2daeab840ceb

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 54c86d3410aa241ecebaf012d93720481a59651e62a7f9af6a7b6a1a57f0c5bb
MD5 f6ebe04fad1b4addc24f855eb885eb11
BLAKE2b-256 3772e0e9ac86b1d3cacb34414f9f8cabaaf750302be84598efc7ff1c0b53d0ab

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 e0f6bcc9ca8b5298f5b661865ca7da436d4437c8313c3d311536f1a4a7337d6f
MD5 666026b45b676994a8ba03ca7250ec62
BLAKE2b-256 b91ffcfce554e69692d8f85b4380aed87e86f578012579b08683f98cb2665e6c

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 9512105454827a7480bdc0929b2c1c077c1a19dd03ccd4c51b704cf3e511237f
MD5 cb921c2ef52e95c8e3fd2590495aacf9
BLAKE2b-256 16bf9213c992ae3fec658b4214c0e17baeb20c4e0498484bda2260aa4e7529dc

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 44873ed152390485a78e46355fba532b24ecc1423983d7fe3e18ebd865a2d0ca
MD5 1ba95363680e32a4b1c949db6d7da6ad
BLAKE2b-256 2540a8faf69de4408c812a462b5bfc6ef1acd2f6e4a1c6cd8ae43d0933ee93b3

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 71815dacf94c223d07938d0651b9ee746090cc9fc22e8973203b99a761fa6603
MD5 fe3603acac2848f4714ea18bed7cc937
BLAKE2b-256 6156bc7407cdc044b5a5db20fc3174846e351e5d606b4af1c4a5f7afa739536c

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 01eed27470bb30b9cdfefa98fa6e621865e03b40bf9df27842919f07dc70da0b
MD5 d54a41c4f77cdfc6fb2f8afbaac34329
BLAKE2b-256 347427d5215db569cf50a8e36c2cd8a2df9482746e1498576ff21a4460eb47b7

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 bb82c5249e1d5ee7029746815e88eaad68cbcce70bd25761266f2bb30e888cec
MD5 fa67c80c798f44edc25b1c605bf59bf2
BLAKE2b-256 d89918254d990896fd7c593c22330b0250cb0465ce64fad8a7cc04b7f8adb3c7

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 495097f24cf9cc5537e793492bcaa6318efd911e92696fb15d7fbfb5e7831b04
MD5 f70bd477438a8d4c866e61a0f26ad976
BLAKE2b-256 ce421b1637ee4ad74b21cb9515b6d66b2e4366bcacd92236a428d6638a7fad5d

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 94d21a8cc4814b72229e07582751fcf5c696abdc2a1a4e9ed77268e9896f1f19
MD5 035916bc6a3580a110725f87db0cb95b
BLAKE2b-256 d1d54f1894a2e89aa14cc5b2ae038568e0b6b87023ab5e007210b62604b3b6e6

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fe7aec52478089057b1692e6fc4a86dbdcae8c1522f6fca5daf891a4df977fb6
MD5 f3faac2b3433cf43d1483589fdf5517f
BLAKE2b-256 ed473f92936c7adb9526e2fc3011b7bdc514b1156294ee83fbd80ac42402b824

See more details on using hashes here.

File details

Details for the file tokenizers-0.22.2rc0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.22.2rc0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f0a186d3847caa9ee2b4585fb4c93056c08c366d6d251e2105c9eb58b0415504
MD5 799392ae838f3ffccd63f06c00d32be9
BLAKE2b-256 c0017be9ce8188c5a5c27e81c2c1d56dff37919d3536e38cfcd0fcaa0db0e092

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page