CinemaCLIP — MobileCLIP-S1 fine-tuned for cinema language understanding, with 23 CinemaNet classifier heads.

These details have not been verified by PyPI

Project links

Project description

library_name: cinemaclip pipeline_tag: zero-shot-image-classification tags:

clip
mobile-clip
cinema
film
movies
multi-task
hybrid
cinematography
domain-specific
image-classification
zero-shot base_model: apple/MobileCLIP-S1-OpenCLIP base_model_relation: finetune license: other license_name: cinemaclip-openrail-m license_link: LICENSE

CinemaCLIP-1.0.0

CinemaCLIP is a MobileCLIP-S1 fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our launch blog post.

This repository ships three serialized forms of the same model:

Torch (model.safetensors) — load via the cinemaclip Python package.
CoreML (ImageEncoder.mlmodel, ImageEncoder.mlpackage and TextEncoder.mlpackage) — for on-device Apple Neural Engine inference.
ONNX (ImageEncoder.onnx, TextEncoder.onnx, plus _fp16 variants) — for cross-platform inference.

Install

pip install cinemaclip            # core
pip install "cinemaclip[coreml]"  # CoreML export/inference
pip install "cinemaclip[onnx]"    # ONNX export/inference

Usage (PyTorch)

from PIL import Image
from cinemaclip import CinemaCLIP

model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()

# End-to-end classification on a PIL image
image = Image.open("still.jpg").convert("RGB")
predictions = model.predict_image(image)
predictions["classifier_preds"]  # Classifier predictions
predictions["clip_image_embedding"]

# Just the image embedding
x = model.preprocess(image).unsqueeze(0)
image_embedding = model.encode_image(x, normalize=True)   # [1, 512]

# Just the text embedding
tokens = model.tokenizer(["a medium closeup of "])
text_embedding = model.encode_text(tokens, normalize=True)  # [1, 512]

The CinemaCLIP.predict_image method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.

Usage (CoreML)

import coremltools as ct
from PIL import Image

img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
out = img_encoder.predict({"Image": img})
embedding = out["clip_image_embedding"]    # [512]
probabilities = out["probabilities"]       # [101] — concat of 23 per-category outputs

# TODO
text_encoder = ct.models.MLModel("TextEncoder.mlpackage")

Usage (ONNX)

from PIL import Image
from onnxruntime import InferenceSession
from torchvision import transforms as T

img = Image.open("still.jpg").convert("RGB")
preprocess = T.Compose([
    T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
    T.ToTensor(),   # yields float tensor in [0, 1] — no mean/std normalization
])
x = preprocess(img).unsqueeze(0).numpy()

session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
emb, probs = session.run(None, {"Image": x})

Output structure

probabilities is a flat [101] vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped CinemaNetSchema.json:

import json
schema = json.load(open("CinemaNetSchema.json"))
label_names = schema["probabilities_labels"]  # len == 101

The classifier heads are a mix of 3 types of classifiers:

Single label (softmax activation)
Multi label (sigmoid activation)
Binary (sigmoid activation)

Evaluation

CinemaCLIP outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading 4B VLMs).

Two inference modes are reported for CinemaCLIP:

Classifier — the shipped supervised heads on the CinemaCLIP image embedding.
0-shot — zero-shot text/image similarity using CinemaCLIP's own text encoder.

Category	CinemaCLIP 0-shot	CinemaCLIP Classifier	Qwen3.5-4B	Gemma4-4B	InternVL3.5-4B	Molmo2-4B	DFN ViT-H-14	MetaCLIP PE-bigG	OpenAI ViT-L-14	MobileCLIP-S1	DFN ViT-L-14	SigLIP2 SO400M	SigLIP2 ViT-gopt
Mean	82.9	87.6	57.6	56.7	55.3	55.3	45.9	45.2	44.8	44.2	39.0	38.7	36.5
Color Contrast	89.6	86.8	33.7	35.3	33.7	35.3	34.0	33.1	49.4	38.7	37.1	57.7	25.2
Color Key	84.9	92.9	78.1	78.1	80.3	64.3	58.2	50.2	53.2	59.4	48.3	22.8	52.6
Color Saturation	82.6	82.6	66.5	65.4	72.1	45.9	55.1	61.8	58.1	35.8	46.8	33.3	31.8
Color Theory	71.3	72.7	54.0	51.7	50.7	48.7	54.7	51.7	50.7	47.3	47.7	31.3	31.7
Color Tones	86.0	86.5	50.2	62.6	70.6	62.1	58.5	50.2	52.0	55.7	47.2	24.0	17.7
Lighting Cast	85.9	90.4	38.3	53.3	39.8	35.7	25.4	29.3	28.8	35.7	22.8	37.8	18.2
Lighting Contrast	93.9	95.3	29.8	39.1	38.7	46.1	35.3	35.5	32.6	39.0	39.4	48.4	37.6
Lighting Edge	87.6	90.4	22.8	38.8	31.2	40.4	22.4	31.6	41.6	34.0	21.2	26.0	25.6
Lighting Silhouette	88.4	93.1	80.9	63.0	48.9	48.8	66.6	67.1	67.4	58.4	43.5	46.2	78.9
Shot Angle	73.4	82.3	41.9	49.2	33.2	49.9	28.0	13.7	19.0	19.6	25.9	21.3	17.2
Shot Composition	95.5	96.0	46.0	54.5	55.7	60.5	27.8	24.3	21.3	22.0	25.2	31.4	11.4
Shot Dutch Angle	61.9	78.5	62.2	65.1	46.7	49.3	27.3	44.5	38.4	56.6	25.9	47.6	68.7
Shot Focus	71.3	71.2	19.9	26.6	26.3	25.1	32.9	31.2	24.4	31.3	37.3	48.2	12.6
Shot Framing	79.2	83.8	38.0	29.6	40.1	34.6	33.6	24.9	23.5	23.9	33.0	7.3	9.8
Shot Height	90.5	91.8	38.1	37.4	41.2	53.0	37.6	33.7	28.9	24.0	33.6	29.6	23.9
Shot Lens Size	67.9	70.6	49.6	28.0	43.6	46.6	32.1	28.0	34.5	30.1	25.7	30.1	17.6
Shot Location	90.9	93.9	81.0	82.2	81.5	79.2	73.0	68.4	68.0	75.6	66.1	65.0	46.7
Shot Symmetry	88.3	92.9	90.2	86.7	76.0	80.2	76.6	78.0	54.0	39.3	24.9	46.0	82.4
Shot Time of Day	69.2	89.0	75.1	66.1	70.7	70.7	68.1	69.6	60.3	73.7	71.2	48.5	42.7
Shot Type	81.8	90.5	81.3	61.2	57.0	57.4	52.8	40.4	36.5	35.7	56.7	46.5	29.7
Shot Type - Crowd	91.5	99.6	97.2	88.2	94.3	94.8	55.9	69.1	68.6	77.2	37.3	52.4	69.3
Shot Type - OTS	92.0	95.5	92.5	85.0	83.9	87.6	53.2	57.0	73.9	60.3	42.1	50.5	51.2

The shot.lighting.direction head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.

Files in this repo

File	Purpose
`model.safetensors`	Blended (α=0.75) torch weights — `CinemaCLIP.from_pretrained` target
`config.json`	Autogenerated `__init__` kwargs for `CinemaCLIP`
`CinemaNetSchema.json`	Schema detailing classifier head metadata, confidence thresholds, preprocessing info
`ImageEncoder.mlmodel`	CoreML `"neuralnetwork"` ImageEncoder (unified embedding + probabilities)
`ImageEncoder.mlpackage`	CoreML ImageEncoder (unified embedding + probabilities)
`TextEncoder.mlpackage`	CoreML TextEncoder
`ImageEncoder.onnx` / `_fp16.onnx`	ONNX ImageEncoder
`TextEncoder.onnx` / `_fp16.onnx`	ONNX TextEncoder

Citation

@misc{cinemaclip2026,
  title  = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
  author = {Somani, Rahul and Marini, Anton and Stewart, Damian},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/ozu-technology/CinemaCLIP}}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.2

Apr 24, 2026

This version

0.0.1

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cinemaclip-0.0.1.tar.gz (62.6 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cinemaclip-0.0.1-py3-none-any.whl (32.7 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file cinemaclip-0.0.1.tar.gz.

File metadata

Download URL: cinemaclip-0.0.1.tar.gz
Upload date: Apr 24, 2026
Size: 62.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for cinemaclip-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`00945a33a1c732abc7c41ccc95e92f5338f5e6100d1f6a3183446515d9d3e0dd`
MD5	`3b76af8a04ac8312d5e6bcd29520eb7b`
BLAKE2b-256	`1fa0acaa98ac22730313d4f685b8f76331aca210fd96521d6ed6bc3b4ea1e856`

See more details on using hashes here.

File details

Details for the file cinemaclip-0.0.1-py3-none-any.whl.

File metadata

Download URL: cinemaclip-0.0.1-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 32.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for cinemaclip-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b839f16e81ae27c212b8b196c47f87f3a71ea8ecce9155437bfa67f173b51600`
MD5	`58636bf3012daa2d4c2502b498d34f82`
BLAKE2b-256	`414db5dc2c263b2699d759a3c0d441dc1d6b81b627e49e684bdc514d4f9d027a`

See more details on using hashes here.

cinemaclip 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

CinemaCLIP-1.0.0

Install

Usage (PyTorch)

Usage (CoreML)

Usage (ONNX)

Output structure

Evaluation

Files in this repo

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes