Fast text tokenizer for the CLIP neural network
Project description
Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network
Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for OpenAI's CLIP model. It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.
In addition to being usable as a Rust crate it also includes Python bindings built with PyO3 so that it can be used as a native Python module.
For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).
Using the library
Rust
[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }
Python (>= 3.9)
pip install instant-clip-tokenizer
Using the library requires numpy >= 1.16.0
installed in your Python environment (e.g., via pip install numpy
).
Examples
use instant_clip_tokenizer::{Token, Tokenizer};
let tokenizer = Tokenizer::new();
let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);
// -> [320, 2533, 6765, 320, 10297]
import instant_clip_tokenizer
tokenizer = instant_clip_tokenizer.Tokenizer()
tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)
# -> [320, 2533, 6765, 320, 10297]
batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)
# -> [[49406 320 2533 6765 49407]
# [49406 1883 997 49407 0]]
Testing
To run the tests run the following:
cargo test --all-features
You can also test the Python bindings with:
make test-python
Acknowledgements
The vocabulary file and original Python tokenizer code included in this repository are copyright (c) 2021 OpenAI (MIT-License).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file instant_clip_tokenizer-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6c4e7e33b1d7ed42a0f6ea0277f6ae08041c1042e45ed9ad535c0a1e34a1db3 |
|
MD5 | 03cc00cc5ba204bac879de14319cc34c |
|
BLAKE2b-256 | da83757db90cb8411b2b69fdf8397d6ae89abd37b086cdcd6006463b920da3d9 |
File details
Details for the file instant_clip_tokenizer-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01790f1045336c2cfe5688cffd0be7f802efb7be6d6e06d2477448e0a2f6d9a3 |
|
MD5 | 8c8497431258c816d7259b81b9afca6e |
|
BLAKE2b-256 | 6cb0250a01713370d1113f803cd623f110e1e8ce5aabe8d427dbf2c8775297a5 |
File details
Details for the file instant_clip_tokenizer-0.1.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ac82292e8c12f31d14029619f0910647a1de84ffafc1c2cc09d94bfb6e3edce |
|
MD5 | bb0c7ed526c44bafc7c23de96af900e0 |
|
BLAKE2b-256 | 383244ee351c47d1e90dd42d4e09788db7fa2364b1e212a78a9985254daa6df4 |
File details
Details for the file instant_clip_tokenizer-0.1.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95b37181cbe791b3b11922470ae02ae66a4b4bda1b9abde5a790cff24c81a6ef |
|
MD5 | c42932f8b9f1bdb21afeb80d6f1f58c9 |
|
BLAKE2b-256 | 95ea38ef907fd5b7cc4cc926a4b57c91aac617dd2f65a63ca7de0f112b194efe |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp312-none-win_amd64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp312-none-win_amd64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37e91dd849f071cb6f802648c56865270bd45513e72a3338a056a16242d3d3d0 |
|
MD5 | 055716caa5026eb8260ed41a0f506c6d |
|
BLAKE2b-256 | 5743c172086cdd7f209e87aa2435e80d23212b6712a597cb514d3f046e61e71c |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d51fe3b0dd25a14d9014765a5bc31281c0d19108fa3153c2d791fe6c4445f1e7 |
|
MD5 | ec4e151549c311b3fe02f456b10c6635 |
|
BLAKE2b-256 | 377aa89b49211d39b1e43cb0955e72eb69f1c9f4a0e5c72cf547c228bc83cafd |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19e188d0e726ba15f5abfb01cee0e93c71daabe3183b5f453ee8247a051a63c7 |
|
MD5 | e776ff037d9d512dc980184644a1a3af |
|
BLAKE2b-256 | 7b469e71363017f45cb416e88cd873d19b036c16e4be6f5fc05fc93cad5f36c5 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1643f630b2c9a1dd31dbb75f03158d815ed09ca81c8eeb41ce6e6361e13d715d |
|
MD5 | fddef76e9f2ef5f07568a66d59992a84 |
|
BLAKE2b-256 | 948665974ced76f1c7332d320cde0aa989e4769d0e518f90eddbb6f3c6d68ba9 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp311-none-win_amd64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp311-none-win_amd64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89f86540dcc486cfbb5069ccce6a67c8e225e04e31edc63ae3acf5ba9f7f8bc7 |
|
MD5 | ba76b0bd6caea6dfa6aef097edf18f89 |
|
BLAKE2b-256 | 0b6cb0df4d585358e4dfd71c991d5d415f925cb1fda58d13a728376a1d68f9f0 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da7ec2c7e302f2850d55ee65ac33d69e63102aa6dacdcae1ad59321def6fcbbb |
|
MD5 | de3ab0041a7b4968f42321ce820130d0 |
|
BLAKE2b-256 | 6e92c2ca0250a1d530430ae89960e93487887c07311c918b92e0e86743b9be4d |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64876d1cbec5dfacb073b96bed4ed3082f83f72521578a07395f9c72c66c518a |
|
MD5 | e66d2a0bb5d39b1b31e92d7dd5d9b27d |
|
BLAKE2b-256 | 70df370cc4bb31b3d7bb560c29f7d973e66cf11979d591dc2b4b419d1696f2f2 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6bdbaba022e64227e84c8b0ad26bd9780c7500933b7995d6aa8f84a92a9f2a1 |
|
MD5 | 90253f7b2b17eb453266b90884f5fd0f |
|
BLAKE2b-256 | f1ff7cd95fec506e3c30526c405c172ae4444f2b06007195679f06143f3eae82 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_7_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.11, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 833632e34bc9144113fa87e4a9329b0a7f74d5dc4196a483fe39323fbdfc2f9e |
|
MD5 | 37c0360f70c7dfd831b531a4384aea9e |
|
BLAKE2b-256 | 9a23921ae68e35bc55f6a180f4d54629d9048892ce4e01884ae39aa46e3d91e9 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp310-none-win_amd64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp310-none-win_amd64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b20881fa2bd08f34e6539097cbd7a78df4126efee95651ab33c39947b785eea |
|
MD5 | ced0b883ffba7a9baddcab436fa75cae |
|
BLAKE2b-256 | a986bfdabb0dca648319c21b8bbb17e7b92fdc285fa56740ceb05b8d4cb742bc |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccdb2524c459f25f5e4c2c1eacb553b4ffe1f6564ef70fb54afdd14615551ead |
|
MD5 | 25ab8c9c117312165f8ec56bbee648d2 |
|
BLAKE2b-256 | b5242baff045f95b339186ec76d73fb0cb07fdb61aeac4904d2ffc12a0c0c755 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_11_0_arm64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44f540646af8e6f92a2b9d9e0c71a04f7bed5a6f5fea27a808067fd5e6551757 |
|
MD5 | acf7815480d0d898b048bf83224d17ec |
|
BLAKE2b-256 | c03fcf2640c26516f343f468cef0c912c8eec114d80d7f1408d050b4a5caa26d |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_10_7_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.10, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c44dbefc5980b98976a55fa6d775753edd2cd8d08151de7dd3d5f52de79199a |
|
MD5 | f50500536cbddd8c2ae4166ee13a55b7 |
|
BLAKE2b-256 | 46c63f44be132b792406a1bc2b56e83e179951c44e9ea33c01aa9deb1f16def3 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp39-none-win_amd64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp39-none-win_amd64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1f15fe63b7e4d967079e94a789bd9472968ac80ad01e9270344d5d6a989daed |
|
MD5 | ef22ebb8df96ad3c8e2875e1d3763be4 |
|
BLAKE2b-256 | 260c1a6fafca57057f470f3a5bd1117917f321879503cfd7e1a68f9fc4d77e18 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1efa28f70702c914c68c328e3a372caa9e93472105e35253c52f7fa2ff96f94 |
|
MD5 | d069e3d977cd979641a639d33c536b7e |
|
BLAKE2b-256 | cdf96774259de84795c18f0d9bc578406b26233e0302b9fef7effaa9914e60a3 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_11_0_arm64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29bfc74551828954fbee53ea1794499d0c482e3f078967fbd2d0ccc13465c61e |
|
MD5 | 9270014590cd80f0ea2670a8b43d5dca |
|
BLAKE2b-256 | ecf52b33f5a5332c4fc586d5c94a54c08e165b06ca9f8697f22b4c3b2ce26805 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.9, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 139be5a4790946d66b814acaf0cfce3db3d3cee6eda46cdcbdee4de074979415 |
|
MD5 | 4cae2c1c71fc7af568de6af90e7e420f |
|
BLAKE2b-256 | 1cefe108404b2380834161377650ca739d4258461b60ae7de98c07a2563c4603 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp38-none-win_amd64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp38-none-win_amd64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3fafe8c28b2f37f9cddefcbdb6de4fa4172cc68e782666a03093f9503059055 |
|
MD5 | 3abafa15a379eb4522dbfeb5d48f24d9 |
|
BLAKE2b-256 | 22059a8830ed1f584518d808a15c50c0f900f2f674c3618038f1bf9f7e3d1a93 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32e3ae8288d17f005dda3dcd2e153ef67a3f5f8922573ee69858373c828ac228 |
|
MD5 | 93365d90832e6b8c63d277ab93b07b90 |
|
BLAKE2b-256 | a25212736f66e8564b8741a1850fd6c48a0ef41d8970df3a669d4b5d1dbb367a |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp37-none-win_amd64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp37-none-win_amd64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.7, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40c942743dfde14c60ff5d75fcafe87c9d167e03cd7f5725b33f7a382f4187f7 |
|
MD5 | 2dccff725886564b5ca265cd2d391efc |
|
BLAKE2b-256 | aa76ffdc5a97ea6173ac2f9d9f48d22eb2171eede660183d220e3cefa6246844 |
File details
Details for the file instant_clip_tokenizer-0.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: instant_clip_tokenizer-0.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e81e2ca71de70574871af7be4a425371510b4de0da9609f4acbab0fde7e2d29f |
|
MD5 | 039f9e3566b0a4c21c79f254e3366943 |
|
BLAKE2b-256 | 6dcf51169ce4d65780582355b4435b2cf7048cbf06ebb6931fc42bf05e8179e0 |