Skip to main content

Fast text tokenizer for the CLIP neural network

Project description

Cover logo

Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network

Documentation Crates.io PyPI Build status License: MIT

Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for OpenAI's CLIP model. It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.

In addition to being usable as a Rust crate it also includes Python bindings built with PyO3 so that it can be used as a native Python module.

For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).

Using the library

Rust

[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }

Python (>= 3.9)

pip install instant-clip-tokenizer

Using the library requires numpy >= 1.16.0 installed in your Python environment (e.g., via pip install numpy).

Examples

use instant_clip_tokenizer::{Token, Tokenizer};

let tokenizer = Tokenizer::new();

let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);

// -> [320, 2533, 6765, 320, 10297]
import instant_clip_tokenizer

tokenizer = instant_clip_tokenizer.Tokenizer()

tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)

# -> [320, 2533, 6765, 320, 10297]

batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)

# -> [[49406   320  2533  6765 49407]
#     [49406  1883   997 49407     0]]

Testing

To run the tests run the following:

cargo test --all-features

You can also test the Python bindings with:

make test-python

Acknowledgements

The vocabulary file and original Python tokenizer code included in this repository are copyright (c) 2021 OpenAI (MIT-License).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

instant_clip_tokenizer-0.1.1-cp312-none-win_amd64.whl (2.1 MB view hashes)

Uploaded CPython 3.12 Windows x86-64

instant_clip_tokenizer-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (2.3 MB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl (2.3 MB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

instant_clip_tokenizer-0.1.1-cp311-none-win_amd64.whl (2.1 MB view hashes)

Uploaded CPython 3.11 Windows x86-64

instant_clip_tokenizer-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_11_0_arm64.whl (2.3 MB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl (2.3 MB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

instant_clip_tokenizer-0.1.1-cp311-cp311-macosx_10_7_x86_64.whl (2.3 MB view hashes)

Uploaded CPython 3.11 macOS 10.7+ x86-64

instant_clip_tokenizer-0.1.1-cp310-none-win_amd64.whl (2.1 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

instant_clip_tokenizer-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_11_0_arm64.whl (2.3 MB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp310-cp310-macosx_10_7_x86_64.whl (2.3 MB view hashes)

Uploaded CPython 3.10 macOS 10.7+ x86-64

instant_clip_tokenizer-0.1.1-cp39-none-win_amd64.whl (2.1 MB view hashes)

Uploaded CPython 3.9 Windows x86-64

instant_clip_tokenizer-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_11_0_arm64.whl (2.3 MB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

instant_clip_tokenizer-0.1.1-cp39-cp39-macosx_10_12_x86_64.whl (2.3 MB view hashes)

Uploaded CPython 3.9 macOS 10.12+ x86-64

instant_clip_tokenizer-0.1.1-cp38-none-win_amd64.whl (2.1 MB view hashes)

Uploaded CPython 3.8 Windows x86-64

instant_clip_tokenizer-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

instant_clip_tokenizer-0.1.1-cp37-none-win_amd64.whl (2.1 MB view hashes)

Uploaded CPython 3.7 Windows x86-64

instant_clip_tokenizer-0.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page