Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file tokenizers-0.0.11.tar.gz
.
File metadata
- Download URL: tokenizers-0.0.11.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b7c42b644a1c5705a59b14c53c84b50b8f0b9c0f5f952a8a91a350403e7615f |
|
MD5 | adbbf2f9b95714d90b883ceb819cfe95 |
|
BLAKE2b-256 | 6c510eb780144128a7e7e108b507077b3a8099c908a8f5c1942db07cd8c312d1 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 797.0 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 503418d5195ae1a483ced0257a0d2f4583456aa49bdfe0014c8605babf244ac5 |
|
MD5 | 089be5cf90db2cebf1f8452b17dac566 |
|
BLAKE2b-256 | 5d46b3a08e93b905bca11cb83a1e9bdc2b76c470125b168c546f753bf3603e14 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp38-cp38-manylinux1_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp38-cp38-manylinux1_x86_64.whl
- Upload date:
- Size: 6.3 MB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ba2c6eaac2e8e0a2d839c0420d16707496b5e93b1454029d19487c5dd8c9b62 |
|
MD5 | 348f59558d1622ad06ce253f51ed122b |
|
BLAKE2b-256 | f0788425a69ada57481d10e0f8ba293499b0bfa4a508d4cc29d02de9056991c1 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp38-cp38-macosx_10_13_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp38-cp38-macosx_10_13_x86_64.whl
- Upload date:
- Size: 869.5 kB
- Tags: CPython 3.8, macOS 10.13+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7de28e0bebd0904b990560a1f14c3c5600da29be287e544bdf19e6970ea11d54 |
|
MD5 | 91475ae989af64d61f2fb7d76b6ee281 |
|
BLAKE2b-256 | a0121ab2c816115df5f19ef7cd716e39475daf1f2d8134e0f221fa2fac60903d |
Provenance
File details
Details for the file tokenizers-0.0.11-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 796.5 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ebe7f0bff9e30ab15dec4846c54c9085e02e47711eb7253d36a6777eadc2948 |
|
MD5 | c275538d4c9583894b36000088f656af |
|
BLAKE2b-256 | 6ad3af5629cf53fac268dadcc69fd4db3096eda17e617ecfa9011787820dd59f |
Provenance
File details
Details for the file tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl
- Upload date:
- Size: 4.7 MB
- Tags: CPython 3.7m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08e08027564194e16aa647d180837d292b2c9c5ef772fed15badcc88e2474a8f |
|
MD5 | 91a40a17a6303582c899acdd0bc14c6d |
|
BLAKE2b-256 | 5fcb3e8902d528538972873d0e9e4e47a31d1849a98e057009e9d383637c96fb |
Provenance
File details
Details for the file tokenizers-0.0.11-cp37-cp37m-macosx_10_13_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp37-cp37m-macosx_10_13_x86_64.whl
- Upload date:
- Size: 869.5 kB
- Tags: CPython 3.7m, macOS 10.13+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce75c75430a3dfc33a10c90c1607d44b172c6d2ea19d586692b6cc9ba6ec5e14 |
|
MD5 | c4e2c9c37169c64c2b17c4f9a53abe75 |
|
BLAKE2b-256 | cef3cafb6b6b814d5b044c5dbb9bf3fd189367fdf0cd44c5aa49a298dfe1aaaf |
Provenance
File details
Details for the file tokenizers-0.0.11-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 796.8 kB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82e8c3b13a66410358753b7e48776749935851cdb49a3d0c139a046178ec4f49 |
|
MD5 | 08738f7080da019f4fe487ae5f61b72c |
|
BLAKE2b-256 | 24d8deab989b6ca8bc12344515e6dd14206d4e9d17d08d48399817c41e00fd16 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp36-cp36m-manylinux1_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp36-cp36m-manylinux1_x86_64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.6m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb44fa1b268d1bbdf2bb14cd82da6ffb93d19638157c77f9e17e246928f0233f |
|
MD5 | e53facf2be7629611c5808e8ae2895df |
|
BLAKE2b-256 | 5e367af38d572c935f8e0462ec7b4f7a46d73a2b3b1a938f50a5e8132d5b2dc5 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp36-cp36m-macosx_10_13_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp36-cp36m-macosx_10_13_x86_64.whl
- Upload date:
- Size: 869.6 kB
- Tags: CPython 3.6m, macOS 10.13+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a66ff87c32a221a126904d7ec972e7c8e0033486b24f8777c0f056aedbc09011 |
|
MD5 | 4bb41a16e332a9573ed1ce7631104417 |
|
BLAKE2b-256 | bda375ee3ee28ead743d05fe854fce0e2549ccdbbf01b6453e4a1d7ef6a32aa4 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp35-cp35m-win_amd64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp35-cp35m-win_amd64.whl
- Upload date:
- Size: 796.7 kB
- Tags: CPython 3.5m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7f5e43674dd5b012ad29b79a32f0652ecfff3a3ed1c04f9073038c4bf63829d |
|
MD5 | a44c12cf99986a43ebbea13d91c1cf89 |
|
BLAKE2b-256 | 55f68354c1e3037d6a2ea6ec57a471e77a226c9b9b4a6d05373806d9079b3aa3 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp35-cp35m-manylinux1_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp35-cp35m-manylinux1_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.5m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4d1ef6ee9221e7f9c1a4c122a15e93f0961977aaae2813b7b405c778728dcee |
|
MD5 | 66128f74d4dd2b5174fa64b7fabfc0d8 |
|
BLAKE2b-256 | cd9c460a5476a8bbffa08a1617bc834b456a3559c0b169ae46559b6c5f0b8399 |
Provenance
File details
Details for the file tokenizers-0.0.11-cp35-cp35m-macosx_10_13_x86_64.whl
.
File metadata
- Download URL: tokenizers-0.0.11-cp35-cp35m-macosx_10_13_x86_64.whl
- Upload date:
- Size: 869.5 kB
- Tags: CPython 3.5m, macOS 10.13+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1385deb90ec76cbee59b50298c8d2dc5909cda080a706d263e4f81c8474ba53d |
|
MD5 | 18aacb7156747905c3c02fd38a443825 |
|
BLAKE2b-256 | 23055f11f8b4874d5649af4f740af72f29cfff4c97c3f67fecc74f96869e723c |