Skip to main content

Fast text tokenizer for the CLIP neural network

Project description

Cover logo

Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network

Documentation Crates.io PyPI Build status License: MIT

Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for OpenAI's CLIP model. It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.

In addition to being usable as a Rust crate it also includes Python bindings built with PyO3 so that it can be used as a native Python module.

For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).

Using the library

Rust

[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }

Python (>= 3.9)

pip install instant-clip-tokenizer

Using the library requires numpy >= 1.16.0 installed in your Python environment (e.g., via pip install numpy).

Examples

use instant_clip_tokenizer::{Token, Tokenizer};

let tokenizer = Tokenizer::new();

let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);

// -> [320, 2533, 6765, 320, 10297]
import instant_clip_tokenizer

tokenizer = instant_clip_tokenizer.Tokenizer()

tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)

# -> [320, 2533, 6765, 320, 10297]

batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)

# -> [[49406   320  2533  6765 49407]
#     [49406  1883   997 49407     0]]

Testing

To run the tests run the following:

cargo test --all-features

You can also test the Python bindings with:

make test-python

Acknowledgements

The vocabulary file and original Python tokenizer code included in this repository are copyright (c) 2021 OpenAI (MIT-License).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

instant_clip_tokenizer-0.1.1-cp312-none-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.12 Windows x86-64

instant_clip_tokenizer-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.12 macOS 10.12+ x86-64

instant_clip_tokenizer-0.1.1-cp311-none-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.11 Windows x86-64

instant_clip_tokenizer-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.11 macOS 10.12+ x86-64

instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_7_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.11 macOS 10.7+ x86-64

instant_clip_tokenizer-0.1.1-cp310-none-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.10 Windows x86-64

instant_clip_tokenizer-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_10_7_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.10 macOS 10.7+ x86-64

instant_clip_tokenizer-0.1.1-cp39-none-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.9 Windows x86-64

instant_clip_tokenizer-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_10_12_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.9 macOS 10.12+ x86-64

instant_clip_tokenizer-0.1.1-cp38-none-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

instant_clip_tokenizer-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp37-none-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.7 Windows x86-64

instant_clip_tokenizer-0.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file instant_clip_tokenizer-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b6c4e7e33b1d7ed42a0f6ea0277f6ae08041c1042e45ed9ad535c0a1e34a1db3
MD5 03cc00cc5ba204bac879de14319cc34c
BLAKE2b-256 da83757db90cb8411b2b69fdf8397d6ae89abd37b086cdcd6006463b920da3d9

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 01790f1045336c2cfe5688cffd0be7f802efb7be6d6e06d2477448e0a2f6d9a3
MD5 8c8497431258c816d7259b81b9afca6e
BLAKE2b-256 6cb0250a01713370d1113f803cd623f110e1e8ce5aabe8d427dbf2c8775297a5

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7ac82292e8c12f31d14029619f0910647a1de84ffafc1c2cc09d94bfb6e3edce
MD5 bb0c7ed526c44bafc7c23de96af900e0
BLAKE2b-256 383244ee351c47d1e90dd42d4e09788db7fa2364b1e212a78a9985254daa6df4

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 95b37181cbe791b3b11922470ae02ae66a4b4bda1b9abde5a790cff24c81a6ef
MD5 c42932f8b9f1bdb21afeb80d6f1f58c9
BLAKE2b-256 95ea38ef907fd5b7cc4cc926a4b57c91aac617dd2f65a63ca7de0f112b194efe

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp312-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp312-none-win_amd64.whl
Algorithm Hash digest
SHA256 37e91dd849f071cb6f802648c56865270bd45513e72a3338a056a16242d3d3d0
MD5 055716caa5026eb8260ed41a0f506c6d
BLAKE2b-256 5743c172086cdd7f209e87aa2435e80d23212b6712a597cb514d3f046e61e71c

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d51fe3b0dd25a14d9014765a5bc31281c0d19108fa3153c2d791fe6c4445f1e7
MD5 ec4e151549c311b3fe02f456b10c6635
BLAKE2b-256 377aa89b49211d39b1e43cb0955e72eb69f1c9f4a0e5c72cf547c228bc83cafd

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 19e188d0e726ba15f5abfb01cee0e93c71daabe3183b5f453ee8247a051a63c7
MD5 e776ff037d9d512dc980184644a1a3af
BLAKE2b-256 7b469e71363017f45cb416e88cd873d19b036c16e4be6f5fc05fc93cad5f36c5

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1643f630b2c9a1dd31dbb75f03158d815ed09ca81c8eeb41ce6e6361e13d715d
MD5 fddef76e9f2ef5f07568a66d59992a84
BLAKE2b-256 948665974ced76f1c7332d320cde0aa989e4769d0e518f90eddbb6f3c6d68ba9

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp311-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp311-none-win_amd64.whl
Algorithm Hash digest
SHA256 89f86540dcc486cfbb5069ccce6a67c8e225e04e31edc63ae3acf5ba9f7f8bc7
MD5 ba76b0bd6caea6dfa6aef097edf18f89
BLAKE2b-256 0b6cb0df4d585358e4dfd71c991d5d415f925cb1fda58d13a728376a1d68f9f0

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 da7ec2c7e302f2850d55ee65ac33d69e63102aa6dacdcae1ad59321def6fcbbb
MD5 de3ab0041a7b4968f42321ce820130d0
BLAKE2b-256 6e92c2ca0250a1d530430ae89960e93487887c07311c918b92e0e86743b9be4d

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64876d1cbec5dfacb073b96bed4ed3082f83f72521578a07395f9c72c66c518a
MD5 e66d2a0bb5d39b1b31e92d7dd5d9b27d
BLAKE2b-256 70df370cc4bb31b3d7bb560c29f7d973e66cf11979d591dc2b4b419d1696f2f2

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c6bdbaba022e64227e84c8b0ad26bd9780c7500933b7995d6aa8f84a92a9f2a1
MD5 90253f7b2b17eb453266b90884f5fd0f
BLAKE2b-256 f1ff7cd95fec506e3c30526c405c172ae4444f2b06007195679f06143f3eae82

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 833632e34bc9144113fa87e4a9329b0a7f74d5dc4196a483fe39323fbdfc2f9e
MD5 37c0360f70c7dfd831b531a4384aea9e
BLAKE2b-256 9a23921ae68e35bc55f6a180f4d54629d9048892ce4e01884ae39aa46e3d91e9

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 4b20881fa2bd08f34e6539097cbd7a78df4126efee95651ab33c39947b785eea
MD5 ced0b883ffba7a9baddcab436fa75cae
BLAKE2b-256 a986bfdabb0dca648319c21b8bbb17e7b92fdc285fa56740ceb05b8d4cb742bc

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ccdb2524c459f25f5e4c2c1eacb553b4ffe1f6564ef70fb54afdd14615551ead
MD5 25ab8c9c117312165f8ec56bbee648d2
BLAKE2b-256 b5242baff045f95b339186ec76d73fb0cb07fdb61aeac4904d2ffc12a0c0c755

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 44f540646af8e6f92a2b9d9e0c71a04f7bed5a6f5fea27a808067fd5e6551757
MD5 acf7815480d0d898b048bf83224d17ec
BLAKE2b-256 c03fcf2640c26516f343f468cef0c912c8eec114d80d7f1408d050b4a5caa26d

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 9c44dbefc5980b98976a55fa6d775753edd2cd8d08151de7dd3d5f52de79199a
MD5 f50500536cbddd8c2ae4166ee13a55b7
BLAKE2b-256 46c63f44be132b792406a1bc2b56e83e179951c44e9ea33c01aa9deb1f16def3

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 d1f15fe63b7e4d967079e94a789bd9472968ac80ad01e9270344d5d6a989daed
MD5 ef22ebb8df96ad3c8e2875e1d3763be4
BLAKE2b-256 260c1a6fafca57057f470f3a5bd1117917f321879503cfd7e1a68f9fc4d77e18

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e1efa28f70702c914c68c328e3a372caa9e93472105e35253c52f7fa2ff96f94
MD5 d069e3d977cd979641a639d33c536b7e
BLAKE2b-256 cdf96774259de84795c18f0d9bc578406b26233e0302b9fef7effaa9914e60a3

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 29bfc74551828954fbee53ea1794499d0c482e3f078967fbd2d0ccc13465c61e
MD5 9270014590cd80f0ea2670a8b43d5dca
BLAKE2b-256 ecf52b33f5a5332c4fc586d5c94a54c08e165b06ca9f8697f22b4c3b2ce26805

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 139be5a4790946d66b814acaf0cfce3db3d3cee6eda46cdcbdee4de074979415
MD5 4cae2c1c71fc7af568de6af90e7e420f
BLAKE2b-256 1cefe108404b2380834161377650ca739d4258461b60ae7de98c07a2563c4603

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 d3fafe8c28b2f37f9cddefcbdb6de4fa4172cc68e782666a03093f9503059055
MD5 3abafa15a379eb4522dbfeb5d48f24d9
BLAKE2b-256 22059a8830ed1f584518d808a15c50c0f900f2f674c3618038f1bf9f7e3d1a93

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 32e3ae8288d17f005dda3dcd2e153ef67a3f5f8922573ee69858373c828ac228
MD5 93365d90832e6b8c63d277ab93b07b90
BLAKE2b-256 a25212736f66e8564b8741a1850fd6c48a0ef41d8970df3a669d4b5d1dbb367a

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp37-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp37-none-win_amd64.whl
Algorithm Hash digest
SHA256 40c942743dfde14c60ff5d75fcafe87c9d167e03cd7f5725b33f7a382f4187f7
MD5 2dccff725886564b5ca265cd2d391efc
BLAKE2b-256 aa76ffdc5a97ea6173ac2f9d9f48d22eb2171eede660183d220e3cefa6246844

See more details on using hashes here.

File details

Details for the file instant_clip_tokenizer-0.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_clip_tokenizer-0.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e81e2ca71de70574871af7be4a425371510b4de0da9609f4acbab0fde7e2d29f
MD5 039f9e3566b0a4c21c79f254e3366943
BLAKE2b-256 6dcf51169ce4d65780582355b4435b2cf7048cbf06ebb6931fc42bf05e8179e0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page