Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.13.tar.gz
(55.6 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.13-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e63d26c519ca9ea34809d1264ae75b34838886d78b2e8a68aa7f205b48a4832 |
|
MD5 | 172264ece869b56d6c44ab66aa7ee2bd |
|
BLAKE2b-256 | 075f5b2ae7f059b45bf6361eb98bee80d5ec91ef228e9085f659b8af40b0af6d |
Close
Hashes for tokenizers-0.0.13-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5224f7b2b6a7334ff8ffae7eb4f1e7c7e258712839e836a057e420397ccae76d |
|
MD5 | 04639e814647284161789324866f4349 |
|
BLAKE2b-256 | 7adf3611d00388037608bdc725ca19642df4b5ceecdc98be600bfd06048645c0 |
Close
Hashes for tokenizers-0.0.13-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e81d6ab8d3c7d4b5a52a9516cc463f2a8d75d064d971d1fc7a6aae624e7417f |
|
MD5 | 029c9a6b1cbec9468c0f5383980f61c8 |
|
BLAKE2b-256 | c6ac2d69d67a4c1f07ff48c9f5d4f718559bea8d364d61ead627cf3591bec941 |
Close
Hashes for tokenizers-0.0.13-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 399bd78d99b30315e4e3b48f75bc255a3d288c5e15646a60846312a342b7a368 |
|
MD5 | 6cd36a1f2ee178ab9abff6c504ea60d6 |
|
BLAKE2b-256 | b2b87694ab18219214c7a5cd49babfba470beb2d9f9eb158fbf682ad5c6ccf47 |
Close
Hashes for tokenizers-0.0.13-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2fe9b59c90b8b877e73a421e17b524db8c76ef207ded34a50763c44790cf6a29 |
|
MD5 | ae522127078ceb6e01cd69e46ed030bd |
|
BLAKE2b-256 | a70957508f0aca9668bf31fd78291a97a0624d59a8aa1f9d8c9f09c4f9201cd8 |
Close
Hashes for tokenizers-0.0.13-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 657e7528fc2f1518fa8623af312b98b5ece9f40e885066844865914624c90e99 |
|
MD5 | 8e19d33335bdacdf6af3188f7d33ea99 |
|
BLAKE2b-256 | c1d4f1cd1f55a5c66f7c8b050d636dcb571b3bd150f97384cd82a24f245d91ad |
Close
Hashes for tokenizers-0.0.13-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7726fdcc50bacad246ee78290111aaa644aa657935861735443a19785e46467f |
|
MD5 | e8086cd1af15924954d3e5f6cd1d8bb2 |
|
BLAKE2b-256 | 0b99fcd7a30ceafa55f2dfd9f4c9251ce542dab97843517f52e8fdb006c858ca |
Close
Hashes for tokenizers-0.0.13-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1e4a82c4383e86206b54feb92f8ff6fa6099b948be34193e5d41edac576131d |
|
MD5 | 2aeb4fdc6d2d0d7e16bab43c5dfc4ab9 |
|
BLAKE2b-256 | 3ff0bfcb982b7dfa35f38c527cead4fb25c9ef608ca75cebfb7402631004bb03 |
Close
Hashes for tokenizers-0.0.13-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2770a1a7bc14bfcad637bb0b0995094b66a3e28e2c89fa0cd80b74806003729 |
|
MD5 | 04c869a144880b2161765d9af3ea20c6 |
|
BLAKE2b-256 | 750837487fd09519ea1801f7f73810186f45adb741e233b00437e52987caf8ca |
Close
Hashes for tokenizers-0.0.13-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad7f865e3ba96278a6fea3bc0fa866986e006cea3ad72815d33db8c9ac90d808 |
|
MD5 | e531bd2fdbef3b88865006f7db1afa63 |
|
BLAKE2b-256 | d18e2c50c8f5c9764379a25680c529ea81fc11f74b9d5600d0f6957e23958850 |
Close
Hashes for tokenizers-0.0.13-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40fcdd16bb1ff188fc4af579cac05fbdb504e807008bdc48ee2061e2b666e377 |
|
MD5 | 7d7967392aa0c35e48c3425acfc9e8e1 |
|
BLAKE2b-256 | d42713b564de37f8f891e4821567842e1b60002cad78c3a9c252ce382525388c |
Close
Hashes for tokenizers-0.0.13-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8bbc871f90968ef5b18f9e6cb7fd74c692c14b623d5167ac91bebe99da739dd |
|
MD5 | 8ec16806aa490ef4ffcd9f1d32ff2416 |
|
BLAKE2b-256 | ffa4763f5871bdd7ab404836cc341885acad814ccd750be43fb94bcec4f68348 |