Fast and Customizable Tokenizers
Project description
Tokenizers
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.
Otherwise, let's dive in!
Main features:
- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust installed:
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"
Once Rust is installed, you can compile doing the following
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install
Using the provided Tokenizers
Using a pre-trained tokenizer is really simple:
from tokenizers import BPETokenizer
# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)
And you can train yours just as simply:
from tokenizers import BPETokenizer
# Initialize a tokenizer
tokenizer = BPETokenizer()
# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")
# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")
Provided Tokenizers
BPETokenizer
: The original BPEByteLevelBPETokenizer
: The byte level version of the BPESentencePieceBPETokenizer
: A BPE implementation compatible with the one used by SentencePieceBertWordPieceTokenizer
: The famous Bert tokenizer, using WordPiece
All of these can be used and trained as explained above!
Build your own
You can also easily build your own tokenizers, by putting all the different parts you need together:
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for tokenizers-0.3.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a9cdb2680640937c749ef21ec7286833d2f8d8661eed113f96078df6e56a1d2 |
|
MD5 | 1139b8a9f18c67c16e2f3e227fd3d68d |
|
BLAKE2b-256 | d8821f048a36afbe32210fb0c66e7d5ec6af8ce73299e0e8a5641a6b90dab561 |
Hashes for tokenizers-0.3.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec5227948d4f903a8173af13956aed909748b35928e2ed306b59da3ab3d0ef67 |
|
MD5 | 1376281f3f8ef50788d68c910818d9ac |
|
BLAKE2b-256 | eb2e0365a12dee40eb2aaa4b2acbe1e5ec5481deddb8d5f9d9cc8fa81a9a8cec |
Hashes for tokenizers-0.3.0-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 386e68dde18b438947344f7f6911fc1d7f9d19b0fd1033cd54c02d7d353745d4 |
|
MD5 | 2c7707c2e239f0ab6b2f47418d483de1 |
|
BLAKE2b-256 | 6b7f478fafd319bd68d110743a6e96e492d5e862026b58935d78d19a45812e0a |
Hashes for tokenizers-0.3.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b40e1c8f29419062f39696ce6742e81824b6f3ddbab14b4305810fb32e910075 |
|
MD5 | 35140417191abd4ed6696dcd45758c13 |
|
BLAKE2b-256 | c4904dadb624963367459088941c17ee8d7dc2c8cbc8502e01628b8013372759 |
Hashes for tokenizers-0.3.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b12d63b6f941bf0d5091ce1e19789765d3e9052401dd2f01ddcb918f86627883 |
|
MD5 | 8f47dd94ef8379ab1af83be3238712c4 |
|
BLAKE2b-256 | 777de736fd090c676c237b0c2544f797ffa92f2e99951cc6d8dd084595ec0717 |
Hashes for tokenizers-0.3.0-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c50e718386dd6c9bc8f08315a69eec0f29cca870e9717bf89cda9b9e4f0b4dca |
|
MD5 | 09b37ca1fe0a30d307bbf73d36c102fc |
|
BLAKE2b-256 | 6af845c26b94fa1e0697b8c346b294533c1be27b2d08e71f2e929c82e2d83060 |
Hashes for tokenizers-0.3.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c97b8f71a2a13e62d38f8bd63cb246a2078c54f57b2ec38fe6c587272ba579db |
|
MD5 | 3efef2968fa19ad47799f49a1cbf9de1 |
|
BLAKE2b-256 | 5eaa96de292e2f9025c666397cc331569a13f8b3064dd24cc6b581a13a2c7082 |
Hashes for tokenizers-0.3.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d672e0661a9f85d577d4c046902c6e8e7e6930e1ad4e7baf5b5e47b4c33f9e3b |
|
MD5 | f1a229b430bce396bad23e56dc86675e |
|
BLAKE2b-256 | 765ffe07dde4b4523fb361ccb7ff00fccb051321dd127dcd355ed9aff9e3641c |
Hashes for tokenizers-0.3.0-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9279bcd7650e2ff91ac9554b013d8db73585815d2604c67e0f59265005f1516c |
|
MD5 | c7d7ac0c53ce45dffd0a01e40a32fd14 |
|
BLAKE2b-256 | 95cf2e30124c9202aad6f7500ed5ff59e4a11659291721d6cd9dd0330b318da2 |
Hashes for tokenizers-0.3.0-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6453b8ad81fa961231ed5ad05ceba35a2c10facb0f0edd3615f0c5da8e2c7dc9 |
|
MD5 | 323f4b3f43406439a04a98275a700e1e |
|
BLAKE2b-256 | 861f477e67f48dd32b85d1cdd1e24ab45edce458f79b14c425d789d3009c77f2 |
Hashes for tokenizers-0.3.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5bc9f70baeae7137bbaf7d7e3cc15583709b3937997023b09477be52be72bc2c |
|
MD5 | 097273932b82946aef39f86f316d20fa |
|
BLAKE2b-256 | ffc0acb84c8957c7c00c109039d349806e62b43cb27bd3ecac69947225bc6d0b |
Hashes for tokenizers-0.3.0-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cba08973c95961376c4b48063b6b349cf51e38514ac96525bbda7ce7d1d76f06 |
|
MD5 | c04030aa7387d041d5a8ab2e371a23e1 |
|
BLAKE2b-256 | b051de97cafe3c7f0e47096bf3bb4f54d6a99fc65b1b4b3ef1db8393a41c0a3e |