Skip to main content

CinemaCLIP — MobileCLIP-S1 fine-tuned for cinema language understanding, with 23 CinemaNet classifier heads.

Project description


library_name: cinemaclip pipeline_tag: zero-shot-image-classification tags:

  • clip
  • mobile-clip
  • cinema
  • film
  • movies
  • multi-task
  • hybrid
  • cinematography
  • domain-specific
  • image-classification
  • zero-shot base_model: apple/MobileCLIP-S1-OpenCLIP base_model_relation: finetune license: other license_name: cinemaclip-openrail-m license_link: LICENSE

CinemaCLIP-1.0.0

CinemaCLIP is a MobileCLIP-S1 fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our launch blog post.

This repository ships three serialized forms of the same model:

  • Torch (model.safetensors) — load via the cinemaclip Python package.
  • CoreML (ImageEncoder.mlmodel, ImageEncoder.mlpackage and TextEncoder.mlpackage) — for on-device Apple Neural Engine inference.
  • ONNX (ImageEncoder.onnx, TextEncoder.onnx, plus _fp16 variants) — for cross-platform inference.

Install

pip install cinemaclip            # core
pip install "cinemaclip[coreml]"  # CoreML export/inference
pip install "cinemaclip[onnx]"    # ONNX export/inference

Usage (PyTorch)

from PIL import Image
from cinemaclip import CinemaCLIP

model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()

# End-to-end classification on a PIL image
image = Image.open("still.jpg").convert("RGB")
predictions = model.predict_image(image)
predictions["classifier_preds"]  # Classifier predictions
predictions["clip_image_embedding"]

# Just the image embedding
x = model.preprocess(image).unsqueeze(0)
image_embedding = model.encode_image(x, normalize=True)   # [1, 512]

# Just the text embedding
tokens = model.tokenizer(["a medium closeup of "])
text_embedding = model.encode_text(tokens, normalize=True)  # [1, 512]

The CinemaCLIP.predict_image method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.

Usage (CoreML)

import coremltools as ct
from PIL import Image

img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
out = img_encoder.predict({"Image": img})
embedding = out["clip_image_embedding"]    # [512]
probabilities = out["probabilities"]       # [101] — concat of 23 per-category outputs

# TODO
text_encoder = ct.models.MLModel("TextEncoder.mlpackage")

Usage (ONNX)

from PIL import Image
from onnxruntime import InferenceSession
from torchvision import transforms as T

img = Image.open("still.jpg").convert("RGB")
preprocess = T.Compose([
    T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
    T.ToTensor(),   # yields float tensor in [0, 1] — no mean/std normalization
])
x = preprocess(img).unsqueeze(0).numpy()

session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
emb, probs = session.run(None, {"Image": x})

Output structure

probabilities is a flat [101] vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped CinemaNetSchema.json:

import json
schema = json.load(open("CinemaNetSchema.json"))
label_names = schema["probabilities_labels"]  # len == 101

The classifier heads are a mix of 3 types of classifiers:

  • Single label (softmax activation)
  • Multi label (sigmoid activation)
  • Binary (sigmoid activation)

Evaluation

CinemaCLIP outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading 4B VLMs).

Two inference modes are reported for CinemaCLIP:

  • Classifier — the shipped supervised heads on the CinemaCLIP image embedding.
  • 0-shot — zero-shot text/image similarity using CinemaCLIP's own text encoder.
Category CinemaCLIP 0-shot CinemaCLIP Classifier Qwen3.5-4B Gemma4-4B InternVL3.5-4B Molmo2-4B DFN ViT-H-14 MetaCLIP PE-bigG OpenAI ViT-L-14 MobileCLIP-S1 DFN ViT-L-14 SigLIP2 SO400M SigLIP2 ViT-gopt
Mean 82.9 87.6 57.6 56.7 55.3 55.3 45.9 45.2 44.8 44.2 39.0 38.7 36.5
Color Contrast 89.6 86.8 33.7 35.3 33.7 35.3 34.0 33.1 49.4 38.7 37.1 57.7 25.2
Color Key 84.9 92.9 78.1 78.1 80.3 64.3 58.2 50.2 53.2 59.4 48.3 22.8 52.6
Color Saturation 82.6 82.6 66.5 65.4 72.1 45.9 55.1 61.8 58.1 35.8 46.8 33.3 31.8
Color Theory 71.3 72.7 54.0 51.7 50.7 48.7 54.7 51.7 50.7 47.3 47.7 31.3 31.7
Color Tones 86.0 86.5 50.2 62.6 70.6 62.1 58.5 50.2 52.0 55.7 47.2 24.0 17.7
Lighting Cast 85.9 90.4 38.3 53.3 39.8 35.7 25.4 29.3 28.8 35.7 22.8 37.8 18.2
Lighting Contrast 93.9 95.3 29.8 39.1 38.7 46.1 35.3 35.5 32.6 39.0 39.4 48.4 37.6
Lighting Edge 87.6 90.4 22.8 38.8 31.2 40.4 22.4 31.6 41.6 34.0 21.2 26.0 25.6
Lighting Silhouette 88.4 93.1 80.9 63.0 48.9 48.8 66.6 67.1 67.4 58.4 43.5 46.2 78.9
Shot Angle 73.4 82.3 41.9 49.2 33.2 49.9 28.0 13.7 19.0 19.6 25.9 21.3 17.2
Shot Composition 95.5 96.0 46.0 54.5 55.7 60.5 27.8 24.3 21.3 22.0 25.2 31.4 11.4
Shot Dutch Angle 61.9 78.5 62.2 65.1 46.7 49.3 27.3 44.5 38.4 56.6 25.9 47.6 68.7
Shot Focus 71.3 71.2 19.9 26.6 26.3 25.1 32.9 31.2 24.4 31.3 37.3 48.2 12.6
Shot Framing 79.2 83.8 38.0 29.6 40.1 34.6 33.6 24.9 23.5 23.9 33.0 7.3 9.8
Shot Height 90.5 91.8 38.1 37.4 41.2 53.0 37.6 33.7 28.9 24.0 33.6 29.6 23.9
Shot Lens Size 67.9 70.6 49.6 28.0 43.6 46.6 32.1 28.0 34.5 30.1 25.7 30.1 17.6
Shot Location 90.9 93.9 81.0 82.2 81.5 79.2 73.0 68.4 68.0 75.6 66.1 65.0 46.7
Shot Symmetry 88.3 92.9 90.2 86.7 76.0 80.2 76.6 78.0 54.0 39.3 24.9 46.0 82.4
Shot Time of Day 69.2 89.0 75.1 66.1 70.7 70.7 68.1 69.6 60.3 73.7 71.2 48.5 42.7
Shot Type 81.8 90.5 81.3 61.2 57.0 57.4 52.8 40.4 36.5 35.7 56.7 46.5 29.7
Shot Type - Crowd 91.5 99.6 97.2 88.2 94.3 94.8 55.9 69.1 68.6 77.2 37.3 52.4 69.3
Shot Type - OTS 92.0 95.5 92.5 85.0 83.9 87.6 53.2 57.0 73.9 60.3 42.1 50.5 51.2

The shot.lighting.direction head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.

Files in this repo

File Purpose
model.safetensors Blended (α=0.75) torch weights — CinemaCLIP.from_pretrained target
config.json Autogenerated __init__ kwargs for CinemaCLIP
CinemaNetSchema.json Schema detailing classifier head metadata, confidence thresholds, preprocessing info
ImageEncoder.mlmodel CoreML "neuralnetwork" ImageEncoder (unified embedding + probabilities)
ImageEncoder.mlpackage CoreML ImageEncoder (unified embedding + probabilities)
TextEncoder.mlpackage CoreML TextEncoder
ImageEncoder.onnx / _fp16.onnx ONNX ImageEncoder
TextEncoder.onnx / _fp16.onnx ONNX TextEncoder

Citation

@misc{cinemaclip2026,
  title  = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
  author = {Somani, Rahul and Marini, Anton and Stewart, Damian},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/ozu-technology/CinemaCLIP}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cinemaclip-0.0.1.tar.gz (62.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cinemaclip-0.0.1-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file cinemaclip-0.0.1.tar.gz.

File metadata

  • Download URL: cinemaclip-0.0.1.tar.gz
  • Upload date:
  • Size: 62.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for cinemaclip-0.0.1.tar.gz
Algorithm Hash digest
SHA256 00945a33a1c732abc7c41ccc95e92f5338f5e6100d1f6a3183446515d9d3e0dd
MD5 3b76af8a04ac8312d5e6bcd29520eb7b
BLAKE2b-256 1fa0acaa98ac22730313d4f685b8f76331aca210fd96521d6ed6bc3b4ea1e856

See more details on using hashes here.

File details

Details for the file cinemaclip-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: cinemaclip-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for cinemaclip-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b839f16e81ae27c212b8b196c47f87f3a71ea8ecce9155437bfa67f173b51600
MD5 58636bf3012daa2d4c2502b498d34f82
BLAKE2b-256 414db5dc2c263b2699d759a3c0d441dc1d6b81b627e49e684bdc514d4f9d027a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page