Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.11.tar.gz
(30.8 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.11-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 503418d5195ae1a483ced0257a0d2f4583456aa49bdfe0014c8605babf244ac5 |
|
MD5 | 089be5cf90db2cebf1f8452b17dac566 |
|
BLAKE2b-256 | 5d46b3a08e93b905bca11cb83a1e9bdc2b76c470125b168c546f753bf3603e14 |
Close
Hashes for tokenizers-0.0.11-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ba2c6eaac2e8e0a2d839c0420d16707496b5e93b1454029d19487c5dd8c9b62 |
|
MD5 | 348f59558d1622ad06ce253f51ed122b |
|
BLAKE2b-256 | f0788425a69ada57481d10e0f8ba293499b0bfa4a508d4cc29d02de9056991c1 |
Close
Hashes for tokenizers-0.0.11-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7de28e0bebd0904b990560a1f14c3c5600da29be287e544bdf19e6970ea11d54 |
|
MD5 | 91475ae989af64d61f2fb7d76b6ee281 |
|
BLAKE2b-256 | a0121ab2c816115df5f19ef7cd716e39475daf1f2d8134e0f221fa2fac60903d |
Close
Hashes for tokenizers-0.0.11-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ebe7f0bff9e30ab15dec4846c54c9085e02e47711eb7253d36a6777eadc2948 |
|
MD5 | c275538d4c9583894b36000088f656af |
|
BLAKE2b-256 | 6ad3af5629cf53fac268dadcc69fd4db3096eda17e617ecfa9011787820dd59f |
Close
Hashes for tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08e08027564194e16aa647d180837d292b2c9c5ef772fed15badcc88e2474a8f |
|
MD5 | 91a40a17a6303582c899acdd0bc14c6d |
|
BLAKE2b-256 | 5fcb3e8902d528538972873d0e9e4e47a31d1849a98e057009e9d383637c96fb |
Close
Hashes for tokenizers-0.0.11-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce75c75430a3dfc33a10c90c1607d44b172c6d2ea19d586692b6cc9ba6ec5e14 |
|
MD5 | c4e2c9c37169c64c2b17c4f9a53abe75 |
|
BLAKE2b-256 | cef3cafb6b6b814d5b044c5dbb9bf3fd189367fdf0cd44c5aa49a298dfe1aaaf |
Close
Hashes for tokenizers-0.0.11-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82e8c3b13a66410358753b7e48776749935851cdb49a3d0c139a046178ec4f49 |
|
MD5 | 08738f7080da019f4fe487ae5f61b72c |
|
BLAKE2b-256 | 24d8deab989b6ca8bc12344515e6dd14206d4e9d17d08d48399817c41e00fd16 |
Close
Hashes for tokenizers-0.0.11-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb44fa1b268d1bbdf2bb14cd82da6ffb93d19638157c77f9e17e246928f0233f |
|
MD5 | e53facf2be7629611c5808e8ae2895df |
|
BLAKE2b-256 | 5e367af38d572c935f8e0462ec7b4f7a46d73a2b3b1a938f50a5e8132d5b2dc5 |
Close
Hashes for tokenizers-0.0.11-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a66ff87c32a221a126904d7ec972e7c8e0033486b24f8777c0f056aedbc09011 |
|
MD5 | 4bb41a16e332a9573ed1ce7631104417 |
|
BLAKE2b-256 | bda375ee3ee28ead743d05fe854fce0e2549ccdbbf01b6453e4a1d7ef6a32aa4 |
Close
Hashes for tokenizers-0.0.11-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7f5e43674dd5b012ad29b79a32f0652ecfff3a3ed1c04f9073038c4bf63829d |
|
MD5 | a44c12cf99986a43ebbea13d91c1cf89 |
|
BLAKE2b-256 | 55f68354c1e3037d6a2ea6ec57a471e77a226c9b9b4a6d05373806d9079b3aa3 |
Close
Hashes for tokenizers-0.0.11-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4d1ef6ee9221e7f9c1a4c122a15e93f0961977aaae2813b7b405c778728dcee |
|
MD5 | 66128f74d4dd2b5174fa64b7fabfc0d8 |
|
BLAKE2b-256 | cd9c460a5476a8bbffa08a1617bc834b456a3559c0b169ae46559b6c5f0b8399 |
Close
Hashes for tokenizers-0.0.11-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1385deb90ec76cbee59b50298c8d2dc5909cda080a706d263e4f81c8474ba53d |
|
MD5 | 18aacb7156747905c3c02fd38a443825 |
|
BLAKE2b-256 | 23055f11f8b4874d5649af4f740af72f29cfff4c97c3f67fecc74f96869e723c |