Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.9.tar.gz
(34.9 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.9-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99f6de923253797dab84ae7794e78e8d6d141887a71f88c13daad5611efc96a6 |
|
MD5 | 85c8a7deacc8f54b2adbb07283e9f32e |
|
BLAKE2b-256 | 7904cffa021701165521e42944cbe21a1b090ebcfafc119425cf161b35ac0236 |
Close
Hashes for tokenizers-0.0.9-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43ec890b4a868e3fc7a1f23d05d2739423afce8ec3cac412a748d679fea0aead |
|
MD5 | 32bcea8738c5b18a68893f701ae0e5fa |
|
BLAKE2b-256 | 54c80f2181ac6d6f8780fd5e13dca4029a75702d1372e26957ac80be6edc86fb |
Close
Hashes for tokenizers-0.0.9-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b166cbbad77ef5543203d5ff103b44d73da9c6a7f2b54ca3221053ff18e215f9 |
|
MD5 | d38f45b1b61eb1c34ec432d0abb66ef0 |
|
BLAKE2b-256 | 7b12f4ad2c6dfb491024dc3f9faa792bcadbd6c245a157fa1f3f7cd0e7366d50 |
Close
Hashes for tokenizers-0.0.9-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0a04d5f209a8d07a3b32649011f91dc0d3fc0ca30f29428c04b89eefcc94060 |
|
MD5 | 80c6c0c07ee1612446c906a64e6c2e74 |
|
BLAKE2b-256 | 62226fa66befcccea4511dcdf09e09d82ca60796044d64a6e01967ae49a7416f |
Close
Hashes for tokenizers-0.0.9-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc07255ddfdc348fd72d5f0c5642f03f6eb3266b2701eec2ad197fa929bb332e |
|
MD5 | 0914e27e7fa85640c38b50fbf7c2bd20 |
|
BLAKE2b-256 | c2ec297376a3697bf2f190de24c54f3493281c411c3403df2a73e54a6a6a455f |
Close
Hashes for tokenizers-0.0.9-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2785443c893ed6d72129d7fec552e5243ea66b4159f2ee3b18d5c1de76cdbc6b |
|
MD5 | c01aadbb9bb2da11850e1bbc99b2dd77 |
|
BLAKE2b-256 | a65e048621cd472dd6986fcd67ff490a52b6a1b69d3797a6f36a6433ea22c510 |
Close
Hashes for tokenizers-0.0.9-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e401d087f31929b0e6526605b72ac09d86689ab2f333b7c852cffad1094314d2 |
|
MD5 | 8a36661a970c7600499f7024464beaef |
|
BLAKE2b-256 | a379c43e752c46c87d6378db85427b072a578caeda446da61533a9aa2c8d244e |
Close
Hashes for tokenizers-0.0.9-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98103a7f6ec888bafa04cf4cc8dbc6a595e7aa76f5598f9d6f5a1dd5f1490473 |
|
MD5 | 7f72a11fe0dc9bfa75e87a78893cc1c5 |
|
BLAKE2b-256 | db6138694f1bbaa8d8528045e083c3d1a59ef062ede798ddf1fce2c5d6f13ee3 |
Close
Hashes for tokenizers-0.0.9-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87920436ad1bce9af064ebb15cbfdffdbe809b1ac7fd8ac13775aa0c861fefb1 |
|
MD5 | 0e989d78faff95c481e3188fc0fb9a3c |
|
BLAKE2b-256 | e2d553e2083058dd32c52b5bff51524dfa67b3845f35112cc9397fb87d6fd865 |
Close
Hashes for tokenizers-0.0.9-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc2778111fcf369fa5d1670b5d6f34456fdb74b513d93294b7497212edbd5915 |
|
MD5 | ad3b669e41605738a6e00076f1942e74 |
|
BLAKE2b-256 | 7045a7c345ece2cf847b85695f76370426c0109719d54c95b9df9b740f21876d |
Close
Hashes for tokenizers-0.0.9-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d67d3dc1ca0b666e33ae9eed809de736c3a563be7daca17e435b88bd1e66d0f9 |
|
MD5 | 83b96540e513ae318ba47e41913e2eee |
|
BLAKE2b-256 | 97ebdfdf438337ec19d2da7f0c3ee08990e78914255fa16dc9faf49f33ef6cfe |
Close
Hashes for tokenizers-0.0.9-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb3c70403bd16dc4df618a64ae3dac78feaf1cf4dfeff4852d5cb75c565f488a |
|
MD5 | 20442c5fe6cfdc6e0633ee449ee06af2 |
|
BLAKE2b-256 | 545309cfd6fff576e63f6dc583e5a469da3d5acb1dbe07468bffb84dd9bac94f |