CinemaCLIP — MobileCLIP-S1 fine-tuned for cinema language understanding, with 23 CinemaNet classifier heads.
Project description
CinemaCLIP-1.0.0
CinemaCLIP is a MobileCLIP-S1 fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our launch blog post.
This repository ships three serialized forms of the same model:
- Torch (
model.safetensors) — load via thecinemaclipPython package. - CoreML (
ImageEncoder.mlmodel,ImageEncoder.mlpackageandTextEncoder.mlpackage) — for on-device Apple Neural Engine inference. - ONNX (
ImageEncoder.onnx,TextEncoder.onnx, plus_fp16variants) — for cross-platform inference.
Install
pip install cinemaclip # core
pip install "cinemaclip[coreml]" # CoreML export/inference
pip install "cinemaclip[onnx]" # ONNX export/inference
Usage (PyTorch)
from PIL import Image
from cinemaclip import CinemaCLIP
model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()
# End-to-end classification on a PIL image
image = Image.open("still.jpg").convert("RGB")
predictions = model.predict_image(image)
predictions["classifier_preds"] # Classifier predictions
predictions["clip_image_embedding"]
# Just the image embedding
x = model.preprocess(image).unsqueeze(0)
image_embedding = model.encode_image(x, normalize=True) # [1, 512]
# Just the text embedding
tokens = model.tokenizer(["a medium closeup of "])
text_embedding = model.encode_text(tokens, normalize=True) # [1, 512]
The CinemaCLIP.predict_image method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.
Usage (CoreML)
import coremltools as ct
from PIL import Image
img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
out = img_encoder.predict({"Image": img})
embedding = out["clip_image_embedding"] # [512]
probabilities = out["probabilities"] # [101] — concat of 23 per-category outputs
# TODO
text_encoder = ct.models.MLModel("TextEncoder.mlpackage")
Usage (ONNX)
from PIL import Image
from onnxruntime import InferenceSession
from torchvision import transforms as T
img = Image.open("still.jpg").convert("RGB")
preprocess = T.Compose([
T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
T.ToTensor(), # yields float tensor in [0, 1] — no mean/std normalization
])
x = preprocess(img).unsqueeze(0).numpy()
session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
emb, probs = session.run(None, {"Image": x})
Output structure
probabilities is a flat [101] vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped CinemaNetSchema.json:
import json
schema = json.load(open("CinemaNetSchema.json"))
label_names = schema["probabilities_labels"] # len == 101
The classifier heads are a mix of 3 types of classifiers:
- Single label (softmax activation)
- Multi label (sigmoid activation)
- Binary (sigmoid activation)
Evaluation
CinemaCLIP outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading 4B VLMs).
Two inference modes are reported for CinemaCLIP:
- Classifier — the shipped supervised heads on the CinemaCLIP image embedding.
- 0-shot — zero-shot text/image similarity using CinemaCLIP's own text encoder.
| Category | CinemaCLIP 0-shot | CinemaCLIP Classifier | Qwen3.5-4B | Gemma4-4B | InternVL3.5-4B | Molmo2-4B | DFN ViT-H-14 | MetaCLIP PE-bigG | OpenAI ViT-L-14 | MobileCLIP-S1 | DFN ViT-L-14 | SigLIP2 SO400M | SigLIP2 ViT-gopt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 82.9 | 87.6 | 57.6 | 56.7 | 55.3 | 55.3 | 45.9 | 45.2 | 44.8 | 44.2 | 39.0 | 38.7 | 36.5 |
| Color Contrast | 89.6 | 86.8 | 33.7 | 35.3 | 33.7 | 35.3 | 34.0 | 33.1 | 49.4 | 38.7 | 37.1 | 57.7 | 25.2 |
| Color Key | 84.9 | 92.9 | 78.1 | 78.1 | 80.3 | 64.3 | 58.2 | 50.2 | 53.2 | 59.4 | 48.3 | 22.8 | 52.6 |
| Color Saturation | 82.6 | 82.6 | 66.5 | 65.4 | 72.1 | 45.9 | 55.1 | 61.8 | 58.1 | 35.8 | 46.8 | 33.3 | 31.8 |
| Color Theory | 71.3 | 72.7 | 54.0 | 51.7 | 50.7 | 48.7 | 54.7 | 51.7 | 50.7 | 47.3 | 47.7 | 31.3 | 31.7 |
| Color Tones | 86.0 | 86.5 | 50.2 | 62.6 | 70.6 | 62.1 | 58.5 | 50.2 | 52.0 | 55.7 | 47.2 | 24.0 | 17.7 |
| Lighting Cast | 85.9 | 90.4 | 38.3 | 53.3 | 39.8 | 35.7 | 25.4 | 29.3 | 28.8 | 35.7 | 22.8 | 37.8 | 18.2 |
| Lighting Contrast | 93.9 | 95.3 | 29.8 | 39.1 | 38.7 | 46.1 | 35.3 | 35.5 | 32.6 | 39.0 | 39.4 | 48.4 | 37.6 |
| Lighting Edge | 87.6 | 90.4 | 22.8 | 38.8 | 31.2 | 40.4 | 22.4 | 31.6 | 41.6 | 34.0 | 21.2 | 26.0 | 25.6 |
| Lighting Silhouette | 88.4 | 93.1 | 80.9 | 63.0 | 48.9 | 48.8 | 66.6 | 67.1 | 67.4 | 58.4 | 43.5 | 46.2 | 78.9 |
| Shot Angle | 73.4 | 82.3 | 41.9 | 49.2 | 33.2 | 49.9 | 28.0 | 13.7 | 19.0 | 19.6 | 25.9 | 21.3 | 17.2 |
| Shot Composition | 95.5 | 96.0 | 46.0 | 54.5 | 55.7 | 60.5 | 27.8 | 24.3 | 21.3 | 22.0 | 25.2 | 31.4 | 11.4 |
| Shot Dutch Angle | 61.9 | 78.5 | 62.2 | 65.1 | 46.7 | 49.3 | 27.3 | 44.5 | 38.4 | 56.6 | 25.9 | 47.6 | 68.7 |
| Shot Focus | 71.3 | 71.2 | 19.9 | 26.6 | 26.3 | 25.1 | 32.9 | 31.2 | 24.4 | 31.3 | 37.3 | 48.2 | 12.6 |
| Shot Framing | 79.2 | 83.8 | 38.0 | 29.6 | 40.1 | 34.6 | 33.6 | 24.9 | 23.5 | 23.9 | 33.0 | 7.3 | 9.8 |
| Shot Height | 90.5 | 91.8 | 38.1 | 37.4 | 41.2 | 53.0 | 37.6 | 33.7 | 28.9 | 24.0 | 33.6 | 29.6 | 23.9 |
| Shot Lens Size | 67.9 | 70.6 | 49.6 | 28.0 | 43.6 | 46.6 | 32.1 | 28.0 | 34.5 | 30.1 | 25.7 | 30.1 | 17.6 |
| Shot Location | 90.9 | 93.9 | 81.0 | 82.2 | 81.5 | 79.2 | 73.0 | 68.4 | 68.0 | 75.6 | 66.1 | 65.0 | 46.7 |
| Shot Symmetry | 88.3 | 92.9 | 90.2 | 86.7 | 76.0 | 80.2 | 76.6 | 78.0 | 54.0 | 39.3 | 24.9 | 46.0 | 82.4 |
| Shot Time of Day | 69.2 | 89.0 | 75.1 | 66.1 | 70.7 | 70.7 | 68.1 | 69.6 | 60.3 | 73.7 | 71.2 | 48.5 | 42.7 |
| Shot Type | 81.8 | 90.5 | 81.3 | 61.2 | 57.0 | 57.4 | 52.8 | 40.4 | 36.5 | 35.7 | 56.7 | 46.5 | 29.7 |
| Shot Type - Crowd | 91.5 | 99.6 | 97.2 | 88.2 | 94.3 | 94.8 | 55.9 | 69.1 | 68.6 | 77.2 | 37.3 | 52.4 | 69.3 |
| Shot Type - OTS | 92.0 | 95.5 | 92.5 | 85.0 | 83.9 | 87.6 | 53.2 | 57.0 | 73.9 | 60.3 | 42.1 | 50.5 | 51.2 |
The shot.lighting.direction head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.
Citation
@misc{cinemaclip2026,
title = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
author = {Somani, Rahul and Marini, Anton and Stewart, Damian},
year = {2026},
publisher = {Hugging Face},
doi = {10.57967/hf/8539},
howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}},
note = {Model weights and taxonomy}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cinemaclip-0.0.2.tar.gz.
File metadata
- Download URL: cinemaclip-0.0.2.tar.gz
- Upload date:
- Size: 62.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04d6b090ca03e450f10eab6014fded357e3197b216e0735f24305d921225d32f
|
|
| MD5 |
de1558d8174acecb5f29c466e907daee
|
|
| BLAKE2b-256 |
aacfa04f84caeb26719c2cd94a1446af55d4d8ecbec0c113dda6f6f190c598cb
|
File details
Details for the file cinemaclip-0.0.2-py3-none-any.whl.
File metadata
- Download URL: cinemaclip-0.0.2-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
488ffa8525eb811038135942bbd087fce4f639f376bce6a48f9b0fded7bd17d9
|
|
| MD5 |
cdfab4e51f602eb9ac02570212ef3e6b
|
|
| BLAKE2b-256 |
c0fb1f60c7e7a9c01dd042203e044996c611f8b61b655c0b263c5bfcb7b3bc6d
|