Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.3.tar.gz
(21.5 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.3-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd0ec2faf6f847d1c2dddc3110a587812e4bc4dd37fb0ad71749b4223b68886b |
|
MD5 | 9e52bbac5d486e099989bffa92c68764 |
|
BLAKE2b-256 | 549b1a5a7ff200440bd5334734b9c0daf746ff54bce3b5adabe52b867f0ad218 |
Close
Hashes for tokenizers-0.0.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e74d504603f76fb8a4e4866b7a93ecdc6fe6242fa78fb819d9f02a6f3ba72c8f |
|
MD5 | 8584a2bc3c399adffb438c75b8719fad |
|
BLAKE2b-256 | facb6dab2a81b129beac78f51bdc9dbf3917a027065bc7aa8ef5070b26d7dd43 |
Close
Hashes for tokenizers-0.0.3-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea392806ce1c4358b9e2daa6bda1632c997583d014c4b9ddaf2cd7e2de3191fc |
|
MD5 | 03389707f3c85c5d10d630d487b01890 |
|
BLAKE2b-256 | 349acc9ec164571a5b67965df23c079cbbcd51971be0791a94c0f259907bb7c7 |
Close
Hashes for tokenizers-0.0.3-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7fd9e4cf35bc131106c1435ec466a7a0e03dbc8a2e6f8307c39c2eac35ceec81 |
|
MD5 | 78ca6a453d65a5d20c2edbfbbd2cbbee |
|
BLAKE2b-256 | 7854d5ffca295ae65da78e304fbf7f74890c66f381d7997e2ea4879afb89a812 |
Close
Hashes for tokenizers-0.0.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dc257e0369440510954983d72fc161395da62601522ae03fb4a5d915f68a6b6 |
|
MD5 | 38a25841d2f776c9a91c91212dac8093 |
|
BLAKE2b-256 | 4d0a6c0710260e8cc09892eac161bbeeed53b2e64b9fa18d9e4492ecaa318977 |
Close
Hashes for tokenizers-0.0.3-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93dc62cc1794051363798ae999b76f47510a3cb21acaac05d7a5a9da0b1e2152 |
|
MD5 | bfdf6c9965c54b07b151cfc7cc2b3d30 |
|
BLAKE2b-256 | 6555c6440067d81b01faf8526eb54672b091ad60493045104724d8cf9a2631dd |
Close
Hashes for tokenizers-0.0.3-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29d935e32c38c0a56e906e5f58458ac1b72e63e6276d6a5741a189409e5f3f0d |
|
MD5 | 575c5b8f2bb29d02871926bd057e313c |
|
BLAKE2b-256 | abe5e3aeb19175290cd23c79db1585d70153d0b83fd18ee4b581dfc83451af59 |
Close
Hashes for tokenizers-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be850b7bc57fb3276cfd65467f1de024edfee789be0c29ab5e08a8cb323bae41 |
|
MD5 | 43098ecbbbf3929c940459cd4d12786a |
|
BLAKE2b-256 | e97df9ba934026b0764400bf79103addd02d05aab1dc9adfcc6bd7be8a575f38 |
Close
Hashes for tokenizers-0.0.3-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0bf5a21a21cdb8d0c1386716ad872cc1445987f6abcd1de8e53427c9b25bd59 |
|
MD5 | 134bb1134c2ad7672c762956d6bf37ff |
|
BLAKE2b-256 | 7b9238790ae7ca2d17d22cc14efb647c304a08a4e3fd278e8535440471c8c5f9 |
Close
Hashes for tokenizers-0.0.3-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed6fbcd386ee38fa77f6ebc2751bbc78846936b98b876f1343a921863c8d48dd |
|
MD5 | 7b16828311826669ca55a15ddb51b6d8 |
|
BLAKE2b-256 | 593eb5739837d7437bc3f8d0b751fff20d072630db03ea7a195fa5e321193ecb |
Close
Hashes for tokenizers-0.0.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ea810773ee1d8ef968b5b3cb0fd7e4cd5f441fb7dd343f26e6366c0f2062d03 |
|
MD5 | 1d216bfbf127961179e616b3de9652ba |
|
BLAKE2b-256 | b647f2eb2a19c737346bec86890ed860d91f43becf3c19d5073c48e671ae47cd |
Close
Hashes for tokenizers-0.0.3-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35c6b5e942f67f18cacc5c802c32a853957ea67143b925066b9b618cff413625 |
|
MD5 | 77ffff298d4f8995813e29aca08046bc |
|
BLAKE2b-256 | b731994e0351f14594638eb7e6ba100f72c34bef22d998affae61023c97dbbcf |