Skip to main content

A Numpy implementation of OpenAI's CLIP image and text encoding deep neural network

Project description

Clippie: A little inference-only implementation of CLIP

Clippie is a simple, CPU-based, pure Numpy implementation of (the forward pass of) OpenAI's CLIP image and text encoding deep neural network.

Usage

Get started like so:

>>> from clippie import encode_text, encode_image

>>> # Encode some text
>>> text_vectors = encode_text([
...     "a toddler looking inside a computer",
...     "people walking along a mountain ridge",
...     "a beautiful lake",
... ])
>>> text_vectors.shape  # (input_index, vector_dimension)
(3, 512)

>>> # Encode some images
>>> from PIL import Image
>>> image_vectors = encode_image([
...     Image.open("toddler.jpg"),
...     Image.open("mountain.jpg"),
...     Image.open("lake.jpg"),
... ])
>>> image_vectors.shape  # (input_index, vector_dimension)
(3, 512)

>>> # Compute cosine similarity
>>> import numpy as np
>>> text_vectors /= np.linalg.norm(text_vectors, axis=1, keepdims=True)
>>> image_vectors /= np.linalg.norm(image_vectors, axis=1, keepdims=True)
>>> similarity = text_vectors @ image_vectors.T

>>> # Note that the matching text/image pairs (on the diagonal) have the
>>> # highest values as you would hope.
>>> similarity
array([[0.29675007, 0.0999563 , 0.12603459],
       [0.09451606, 0.25567788, 0.18573087],
       [0.1604508 , 0.17910984, 0.2590417 ]], dtype=float32)

Generating a Weights File

By default, Clippie will automatically download and use a copy of the CLIP ViT-B-32 weights (preconverted into Clippie's format) from the corresponding GitHub release. (See weightie for the weight storage format and auto-download mechanism.).

You can also use the clippie-convert-weights-file script can be used to convert a PyTorch weights file from the reference CLIP implementation into the native weights format:

$ pip install path/to/clippie[convert]  # Extra packages needed for weights file conversion
$ clippie-convert-weights-file /path/to/ViT-B-32.pt ViT-B-32.weights

The conversion script requires extra packages to be installed in order to unpack the PyTorch weights file format (including PyTorch). After conversion, these dependencies are no longer required.

You can then provide the path to the converted weights file (as a Path object) to clippie.load to load the needed weights and pass them to the encoder functions:

>>> from pathlib import Path
>>> from clippie import load, encode_text, encode_image

>>> # Load the weights...
>>> weights = load(Path("/path/to/converted.weights"))

>>> # Use them...
>>> text_vectors = encode_text([...], weights.text_encoder)
>>> image_vectors = encode_image([...], weights.image_encoder)

The converted weights file will typically be larger than the source CLIP weights file because all values are expanded to float32 so that they can be directly memory mapped by Clippie. Clippie is float32-only since most CPUs only natively support down to float32 (and not float16).

Preemptive FAQ

Why does Clippie exist?

I wanted a decent search facility for my personal photo collection without relying on a 3rd party service (e.g. Google Photos). Based on the impressive results reported by various other projects it became clear that OpenAI's CLIP model could work well, in spite of photo search explicitly out-of-scope for CLIP. In fact, in my experience so far, search quality is substantially better than Google Photos' search function.

To ensure I could build my photo search system on something which would remain stable for some years, I wanted to avoid using anything based on a cutting-edge deep learning framework -- ruling out the reference implementation and other open source options. I am not in this for the research: I just want the tasty, tasty search results!

Finally, I've been looking for a reason to finally learn more about deep learning and this application was a good excuse. As you might expect from a learning exercise, there is a perhaps slightly excessive quantity of commentary in the code...

Why not the CLIP reference implementation?

By contrast with the reference implementation, Clippie has only a few comparatively light-weight and stable dependencies (chiefly Numpy and Pillow). As such, the largest download needed is a copy of the weights, not gigabytes of software. Furthermore, unlike most deep learning libraries -- which cater to a fast moving field -- all of the dependencies used have been stable and well supported for many years and are likely to remain so for many more years.

Separately, CLIP makes some slightly quirky choices in its implementation from a software engineering point of view (e.g. its quirky vocabulary binary encoding format. As such, in several places, Clippie does things slightly differently and, hopefully, a little more clearly.

Why CPU only?

The smallest ViT-B/32 model can process an image or text string on my laptop's CPU in 50-100 milliseconds and subjectively good quality results in searching my collection of ~100k photos. This is plenty fast enough for (my) personal use cases so no need to buy (or manage) a GPU!

Clippie's Numpy based implementation runs approximately as fast as the PyTorch-based reference implementation on a CPU. Numpy appears to make fairly effective use of available SIMD and multi-core facilities. That said, I'm confident performance would be improved given a little profiling and effort. For instance, no attention has been paid to memory layout or the effects of batch sizes.

Why float32 only?

Whilst some of CLIP's Vision Transformer weights are given as 16 bit values, with the exception of some recent ARM systems, most CPUs only support efficient float32 arithmetic. This inflates memory usage somewhat but is faster on most systems in practice.

Why ViT-only?

The CLIP authors reported that their Vision Transformer (ViT)-based image encoder approach worked as well or better than ResNet. Since the text encoder already uses a Transformer, I'd already done most of the work implementing ViT and didn't fancy implementing the ResNet too.

Why inference only?

Since I'm only interested in using CLIP and the published weights work well I had no need. That said, some of the limitations of OpenAI's training set (presumably in the name of limiting potential abuse) do leave some gaps in functionality. For example, the published weights are incapable of finding pictures of breast feeding.

Separately, I'm especially keen to avoid falling down the rabbit hole of model training lest I get sucked into deep learning research :).

Does Clippie implement vector search?

No.

Whilst various fancy libraries and services exist which implement (screamingly) fast approximate nearest neighbour search on millions of vectors, they simply aren't necessary at the scale of (my) personal photo collection. Naively using Numpy as in the example code above can compute similarity of a search vector against 100k image vectors in about 50ms on my laptop.

Does this reuse any code from the CLIP reference implementation?

No -- though it does use its model weights and byte-pair encoding data.

This software is a from-scratch reimplementation of CLIP based almost entirely on the descriptions in the original papers. However, to ensure weight-compatibility, some parts of Clippie necessarily mimic the reference implementation -- though no code has been reused or adapted.

Clippie does, however, re-use the data published alongside the reference CLIP implementation:

  • The vocabulary and byte-pair-encoding data included in the (MIT Licensed) CLIP repository is also included in Clippie (albeit in a different format).
  • The model weights provided with the reference CLIP implementation are also used (again, after format conversion).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mossblaser_clippie-0.0.2.tar.gz (572.0 kB view details)

Uploaded Source

Built Distribution

mossblaser_clippie-0.0.2-py3-none-any.whl (553.9 kB view details)

Uploaded Python 3

File details

Details for the file mossblaser_clippie-0.0.2.tar.gz.

File metadata

  • Download URL: mossblaser_clippie-0.0.2.tar.gz
  • Upload date:
  • Size: 572.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.1.dev0+g94f810c.d20240510 CPython/3.12.6

File hashes

Hashes for mossblaser_clippie-0.0.2.tar.gz
Algorithm Hash digest
SHA256 a17727c0fa1fec7a0612d1cef962d6346d1134a6160dea3f48b97d7a8c5b293f
MD5 7f4d87953251a5981b97528241a57554
BLAKE2b-256 9ac97365f8de3119bcd9566e4f2b1fcadf50802fc773e2a376dfc1ed51d82d0b

See more details on using hashes here.

File details

Details for the file mossblaser_clippie-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: mossblaser_clippie-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 553.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.1.dev0+g94f810c.d20240510 CPython/3.12.6

File hashes

Hashes for mossblaser_clippie-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3232d9b16a37aeaa44298a70322420bf2c109c623edf196548dc6fb9cae9c9cd
MD5 34a6191a8f9199fa18015b3fa45ad925
BLAKE2b-256 222b48ac3676d3305f844cf13a216ec54d1f6f6aebb7119815b76616f2aa24cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page