Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.4.tar.gz
(25.6 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.4-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b4b28e2c1a53e11b794cc9122c28b7d512238efdec551f3179b242d888eaa19 |
|
MD5 | 5bffbda703ec029b5efa691f71ddf782 |
|
BLAKE2b-256 | a7170d66e66b8e79c1fc0e6c58a5a62d1f153d4a65a85e7212a22c9b92195edc |
Close
Hashes for tokenizers-0.0.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2d29d7194332c3b5cb4b0091227f03c9dfcbfe8c442a78dd7d8796ecad07631 |
|
MD5 | 81f3811f88d8ae942e9e4c92696b63c7 |
|
BLAKE2b-256 | f3e141ac9752c4cc2c32dfa476117f7f137186cb019fef9f48391574dd095ad0 |
Close
Hashes for tokenizers-0.0.4-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3adda4fab8be71e2457edf82ed5c78d2e225c8685d59169c45e3e0ad2b928e82 |
|
MD5 | 16cc2725bb306a13aa9fb01fd277ddf2 |
|
BLAKE2b-256 | 26eb3be95aeda1490200e30f25fcdc20d1da675603d724c953b3134596ad318c |
Close
Hashes for tokenizers-0.0.4-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40631328824910d2c3ee192c4736cd95e863e9a1b44eb565071266a63e04f9e9 |
|
MD5 | fb40f5f6163153a7212dd02c082827b9 |
|
BLAKE2b-256 | db35241f0338a0b8530074fcfebe7a13b45bb512c98fb5a988be941b27f75d55 |
Close
Hashes for tokenizers-0.0.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed0b5537980efb3a30f70ce9aebed8261fe78adf8c7ed596da00ad450bf98f45 |
|
MD5 | 0aaa72ce2a55f5f9accd0281c4c9265c |
|
BLAKE2b-256 | 2f6851e710a6c4b74e8f135ed0d8b2ad083fe88e716c83ee6f3c02bde074dc20 |
Close
Hashes for tokenizers-0.0.4-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58df90911f832085140a273725ecc0c7119c0554b61aad4bdce0ed77b5e9e45c |
|
MD5 | 64ef07075d0838f695d3ca029d7a100b |
|
BLAKE2b-256 | ad5d87b98e4cdc465090bf87d9745a61631f183bd8e436332cf61cfdbdd16668 |
Close
Hashes for tokenizers-0.0.4-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | abc2b3c9513d1542d07c5203a0b9582263bd43562260187ced855078a12aaa21 |
|
MD5 | 016aec00626399e614b73e6b73da79d9 |
|
BLAKE2b-256 | aea0f848687338a2fd7d1243d5a7755eb7eecf29da1fab3ed13cb35c346f3a80 |
Close
Hashes for tokenizers-0.0.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b5a28bf52c6dc4f83d17b0846d8171b45153ad48ced8b0d5820085258d7b7b2 |
|
MD5 | c9e5c42415b349e680d6893ff2a59b58 |
|
BLAKE2b-256 | 75417f4bd50024fa46e500842e6c8cf95768c0866cb6c944a904782bec71bb17 |
Close
Hashes for tokenizers-0.0.4-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e0be35d450bfe6cb3e6a8435173ed9afd8f5597218e7b0da387f6765a459b4a |
|
MD5 | 99f7f291cbefaf343bc11c03b6de2984 |
|
BLAKE2b-256 | 995d85f9da8dc855da740e9ff5f88ba9ee0ff9c0bb4de178f1b34ace197567a9 |
Close
Hashes for tokenizers-0.0.4-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1fffd674fa695785900485f74d28d5627a2d9e611a8d2d91e4480a786a8ed0d |
|
MD5 | 9f94d676e71f28eb3ec5858e2609ff25 |
|
BLAKE2b-256 | b245b5b1a5144be71b97ef5657dd9d1d25f85d546537ea26534e7d879db028ed |
Close
Hashes for tokenizers-0.0.4-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39cab7d8dff13b7f5f3059f46e2bd54f6971bd230fc9aa57a6f396b1beaa8a47 |
|
MD5 | 49146710a45c6cba5c9f538fa4d7a14a |
|
BLAKE2b-256 | af9db43316f18b270a4a0a52fc17394fae84e4782ef434c1593707e10f475db9 |
Close
Hashes for tokenizers-0.0.4-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9580c158e6cee58a901653da37903f1848719a91c6e70942ae28cfc80af068e9 |
|
MD5 | 3cc0280492bad62738524cc5aff9bee0 |
|
BLAKE2b-256 | 6f779b2e5c25774a1d81bc843c80c432a745ffa45796fbad0b08ff225962a6c1 |