Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.10.tar.gz
(30.7 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.10-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64e3a5a335c46aca22ea68e99ba8d006dd537ac1d9d5f17c5f84e88436a2a673 |
|
MD5 | c5e4034229762be850d0610de4655764 |
|
BLAKE2b-256 | dcdd7577be71c73d1e78e50cdd37726b3344ba2bc762692f3d8608d357c9565c |
Close
Hashes for tokenizers-0.0.10-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 648d6ba6e54973a21be5faa992ef54b615e0cb93d902dbbce36c1061a525a369 |
|
MD5 | 34c67509f6d44c60ab4e9b6f6b9be2fe |
|
BLAKE2b-256 | d3bbbcfca7a3e4b6e8440aebef2f21f0ffb9e25782b57e418ff24f6fffb1646a |
Close
Hashes for tokenizers-0.0.10-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba92c859fd4191e7827fa4c1e745681e943f5be7ea21dd2308932da860b2016f |
|
MD5 | 0c87305da872848746cb339f53269106 |
|
BLAKE2b-256 | c342914d6f7ec7bdeff91211bd14423bac16956a08b78e4a4b869283a19b61cb |
Close
Hashes for tokenizers-0.0.10-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9088655cee1b04d0ffb069f32b39f02d1ca4182f078cdeef8b73361334ad0775 |
|
MD5 | ff72f6ccb9c76eeec3faec1d7eddf41e |
|
BLAKE2b-256 | bdc3827e25d5dd5dc2849dfcd0619bea3c8e07100436ce5c503cdc0a88c433aa |
Close
Hashes for tokenizers-0.0.10-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 26193a5a648236c5f6b5dafee43c132cf81ca0cd12ddff8676be44116e9cfbc3 |
|
MD5 | fa2cf080c4271f9d95f8b9cd4cb6d93c |
|
BLAKE2b-256 | 3dd89bbdc3cbcd6c41b397aa3847b466f0547127c5ba31de6d6ed42e194ffd74 |
Close
Hashes for tokenizers-0.0.10-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44e2a1eb48889615f17541deef013e7b881eb2c1e3d8c14889b792903bcdef76 |
|
MD5 | fdc129c352d649ba1bb58a8bfea84e53 |
|
BLAKE2b-256 | 2ab4a5295582ff3749315fbab604d340614ff170405d698f62e60ae9520ade4b |
Close
Hashes for tokenizers-0.0.10-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf855f993f71e35e5c37d4ef87715a1a5505d8bf1fb9e91db21e10dd2dc9693f |
|
MD5 | 6e7d6fd436444343aeb8505886cbeee0 |
|
BLAKE2b-256 | 86bfdfc90f077b74c97565a5e41b46209042de2118b5724ceaacca75f63645ba |
Close
Hashes for tokenizers-0.0.10-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84555c0af0105a59b6c9d92e53a38e5221651d7a7cb167b34c72d56dbd5d820c |
|
MD5 | 5b1b66e45b10d13457fc92a778c26f7f |
|
BLAKE2b-256 | 34a690ee8652a9976ae1f3fea081390312dcbe6d629c3ecaa2bf014b0e7ce95b |
Close
Hashes for tokenizers-0.0.10-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5df32b11d0b0a19ee1ebd4f8b32c2f056eaf7c63540908615ae4573c76c375a |
|
MD5 | 7f39366b74eb5df83cf7dd5e0e4bdb9f |
|
BLAKE2b-256 | 4ec26052a4f293a44b22f357011d1f5917076ec0f2881d5fa57657ba0b8167ea |
Close
Hashes for tokenizers-0.0.10-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55395ba33d6d00935db13b4e27bcb5946040a206809f6efebe010c7d98861013 |
|
MD5 | d877d4623065b17b09c34c607ca1e945 |
|
BLAKE2b-256 | 8fda41fce30ba3756b7f035c44998419fa3abc9fe65d131d355c6061e5e1eb6a |
Close
Hashes for tokenizers-0.0.10-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54bd2ffda27853c99cd41f8df8e585361481be043b0d6245d003d6f71c205e46 |
|
MD5 | 6db22572d6901172abeb0521003e61db |
|
BLAKE2b-256 | 4e2e9a33edc51253b183852dee5de7f89566f0e53a5cf425ad1575964d83379d |
Close
Hashes for tokenizers-0.0.10-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 892713d929593a077c00167f9fe8a108fb2b5ce83b7a6d1324e417caf7629da6 |
|
MD5 | 54b104039424a3620b8af27c7c256a04 |
|
BLAKE2b-256 | df0c3aa923245aeea88cd0d8ddf6dec2bc82344c798cefee877dd3ee51354c5b |