Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.5.tar.gz
(28.8 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.5-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92e573ac02df00237eb269338158db46b2548d70dab11be57393807763fb6b34 |
|
MD5 | 4002190e1810c6b6d7abc7ec996d1ebc |
|
BLAKE2b-256 | 9322df5900a481ec54dddee81f858efb04a2225119e62bb811e02267be9afa50 |
Close
Hashes for tokenizers-0.0.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcc0c17b480863c2734389c287da5cd25368ef985db019677374f2021c2a772f |
|
MD5 | 7d886759c57f8e723f620c4601a26a2d |
|
BLAKE2b-256 | db50e4c27d20553538452b431a595a7a113e2e552eaf352dfddf00e8b372841d |
Close
Hashes for tokenizers-0.0.5-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e6c28d7ab133b0c1ad59f4d41991fd90109f486f96a703be5abcba6f700be68 |
|
MD5 | 82115505ad81e62312ced4c0d5e6de47 |
|
BLAKE2b-256 | 20110278c108c1053c90539075a72433bdf73cc59a0b2498460b25f2261439f1 |
Close
Hashes for tokenizers-0.0.5-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bc35c7596f2aa435864eedb338ffe750128a4fa2b3885e6e54d2c2012488516 |
|
MD5 | 3fbe4af80120e976868cc5afc1a51768 |
|
BLAKE2b-256 | ac178ea9baf991c9724ae4d02856fcc3d360b0fb6e861f509d1a454a6fd5aca0 |
Close
Hashes for tokenizers-0.0.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f5932f7d1afe977e976276c151cc6e447a0240f32db184712c6c49028b89374 |
|
MD5 | ad90d96c0853230d982a3187c9be30bf |
|
BLAKE2b-256 | 139acd1afd4b9da0095a09ee45c06287d2172d4463fc6a8579722499eee46f4c |
Close
Hashes for tokenizers-0.0.5-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab948374106284500b7991b498ead3d55e49af54f23389080ec5ed5e5ee1d666 |
|
MD5 | cc7e4941c24cfc5e1a784dbf9f5a3a3f |
|
BLAKE2b-256 | bb899bec8b246dc9eb9d0a35a4551cab40d3bc69d3af9252b8bd393059eae601 |
Close
Hashes for tokenizers-0.0.5-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b47ab70875795944a54adbf8e5c0594b23c650b218133af3faa6afa8a89b35e5 |
|
MD5 | 79a54295b88e76016ae6d5103f385767 |
|
BLAKE2b-256 | 435e58c3612467145d409dc80ed581d8e4327998bb1d1b65d77ffa474a5fd887 |
Close
Hashes for tokenizers-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6afc29e591ba6f879c4d3bce9c40e1769b6cd59b90c53a63b36f4ecc8c6ef305 |
|
MD5 | 49164276c3501c71fdf9da6afc39e4bc |
|
BLAKE2b-256 | a376e9cf639200332268ae351e746790076c1ca6497a175acef0fff6247ceebe |
Close
Hashes for tokenizers-0.0.5-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2fa0ee1df3563f991064e1f10e919ef0d66f93452cb9e1f5e627bea704931ba |
|
MD5 | 85ca9e847f8c81f119563f0851cb03fb |
|
BLAKE2b-256 | 7afc479f0c9698507d985e292a4d28f42a3eabd611db9c397fc44e26f2bcef22 |
Close
Hashes for tokenizers-0.0.5-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2c9779774cd7892436df6795e865fa1a7a2479ce843fd89efe9d2b997d9ab8d |
|
MD5 | 2dc1a077b59993abbd4cfa6e81df7b24 |
|
BLAKE2b-256 | 38b28272414625004a2391b65fa2d1a76b453408e2d10be84b2d9e5e93948845 |
Close
Hashes for tokenizers-0.0.5-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93dc0a8837b87691cdda03a3795dacf1ebd58590cf5eae0c4d1cd8038b2114cd |
|
MD5 | 02c270cc5086f60fe9a39c049be22776 |
|
BLAKE2b-256 | a084aa8cd48cdbcb7ff471c4979dba890e9f8c19534106e7cc42638fde51c173 |
Close
Hashes for tokenizers-0.0.5-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0686aa2ee7fb21698747ceb18d446e35320e2a41576aa319df0282308690ec6 |
|
MD5 | 5e24461d5f89a8a3f2f524e5a2b1fab2 |
|
BLAKE2b-256 | be3edcf4eb23916c7943ebf6bd6783097885bb3d8b901ec7efa4d438646dffd9 |