Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.8.tar.gz
(32.9 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.8-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebfa81384b525415bbed3a119cc502ee73a7d65662853dfcdfe557a85d4e9615 |
|
MD5 | aac4c3edf97d42a9e7fbe47d0610aa3a |
|
BLAKE2b-256 | 472022b26d47a33fbeb892a4945036391ae6e93a7e221d63a10ce21c8d84c978 |
Close
Hashes for tokenizers-0.0.8-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b800444868ff318f741507aaeff416abb8495e29609778ad2aedeff39e88ae1b |
|
MD5 | a57a8fd97f0928c24a78382635c8947a |
|
BLAKE2b-256 | 6c134b8f1dba204cd820b894037425efe7b4e8dcda384769e8820daecb1dec0b |
Close
Hashes for tokenizers-0.0.8-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | edbe284f8908beb07f28e7900be228b9762e88bcee5dc7b273be2fa2da757bd0 |
|
MD5 | 7914c7b46287ddf338ceb51af98d78fa |
|
BLAKE2b-256 | 89d8ec3625737c382677bbd56fe503709651d98f5ce8fc145bf4e29b3c48d5ec |
Close
Hashes for tokenizers-0.0.8-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7289f3abb939de67e70d7c8f96e175b6a780ddfbc06077957f060cfbc85aa735 |
|
MD5 | 0f327ed9abbf83f386421e6bc6dd2606 |
|
BLAKE2b-256 | 2726c488bf0e9a9aa3aca3e56a850c0284a6f54f1503a7ff0feb826b4d569796 |
Close
Hashes for tokenizers-0.0.8-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3130479965533d1e3f45dfa75455eeff7b8fc31cf66bceed05ffbfa79b0f0000 |
|
MD5 | b6fcd06e75b7e353ad803d06991d4734 |
|
BLAKE2b-256 | 79f3e6d275ad5b7ef5065b5bfd56cd82f0ffe2a7d55f77afb34bcb2c3cd7245a |
Close
Hashes for tokenizers-0.0.8-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ed9fbbe0d78c28b1ac0cac7f11f82c745586582154bba0cc97aeaab0bff54cf |
|
MD5 | d26ca9679d39365ac337390082e4b5e2 |
|
BLAKE2b-256 | 81c668dbb5a9ca8232a43e402ede5b5e692b6d8055a062373b1c6d6504cf1070 |
Close
Hashes for tokenizers-0.0.8-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7aff1e8cbb8b12f838a3b64daa0c9d3b9f2349f865c0a85532a9eb94c7247e2a |
|
MD5 | 17bc817686b39ef7f65194fcb9aee6d4 |
|
BLAKE2b-256 | c6d4f32e33727bdc699a960fec596c9ddb38deb4ef92e3c5a0c67b231bbd7b3f |
Close
Hashes for tokenizers-0.0.8-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc929724d940ca6e5e7e003f13c02154dc604ab6285a6bd08cc02f76e09e1684 |
|
MD5 | d0c82fafb1c90fd66408ec37c9ca623d |
|
BLAKE2b-256 | a74497380ce7b1f328aa6c4bc945cc16a8abf528784eb2bb15024880f4aa8749 |
Close
Hashes for tokenizers-0.0.8-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa875b3a66ea28c5ea3695fcaf89bd9ec075768b3fefce9e612292a91ec18f26 |
|
MD5 | 3e7ae6d9bf2226f918beb5192e16d77a |
|
BLAKE2b-256 | df28f38a9bc5441b96edb05c88f99a5e233e19497ddc5c0cbbf0c23a62cc68fb |
Close
Hashes for tokenizers-0.0.8-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24b3ced23cbf63a74423312260114c71f690040815c8a93750c670245f6c68b7 |
|
MD5 | d7786e98b5faccbad712e8a733ea28b5 |
|
BLAKE2b-256 | 01895e0d029315a530cbeea52229a3a63e3d9f6459723dfb27ab5e00d771b268 |
Close
Hashes for tokenizers-0.0.8-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cefbe50c0219a9829bf245cd3b53488a10004d044863c84f099e6760e64da515 |
|
MD5 | 7c89200065c4b4fa91b275e00a7bd1d9 |
|
BLAKE2b-256 | 17f1aca99f99370894cf21c16346571c9f2b5fc31e8a0cfc547a1d3616ad5fea |
Close
Hashes for tokenizers-0.0.8-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b49731bd169e893912bdbd5e887eeb5dc660487edb4940f7b1635d6149875611 |
|
MD5 | 706fb6b89e8226ca8028c087dd1e16e5 |
|
BLAKE2b-256 | bcc24f5fc753cd789425ac2e5abdd947c6e0676231b9c50660d6b4470ae164ed |