Multi-Modal Transformers library for Semantic Search and other Vision-Language tasks
Project description
UForm
Pocket-Sized Multimodal AI
For Content Understanding and Generation
Welcome to UForm, a multimodal AI library that's as versatile as it is efficient. UForm tiny embedding models will help you understand and search visual and textual content across various languages. UForm small generative models, on the other hand, don't only support conversational and chat use-cases, but are also capable of image captioning and Visual Question Answering (VQA). With compact custom pre-trained transformer models, this can run anywhere from your server farm down to your smartphone.
Features
- Throughput: Thanks to the small size, the inference speed is 2-4x faster than competitors.
- Tiny Embeddings: 256-dimensional vectors are 2-3x quicker to search than from CLIP-like models.
- Quantization Aware: Downcasted embeddings from
f32
toi8
without losing much recall. - Multilingual: Trained on a balanced dataset, the recall is great across over 20 languages.
- Hardware Friendly: Whether it's Apple's CoreML or ONNX, we've got you covered.
Models
Embedding Models
Model | Parameters | Languages | Architecture |
---|---|---|---|
uform-vl-english |
143M | 1 | 2 text layers, ViT-B/16, 2 multimodal layers |
uform-vl-multilingual-v2 |
206M | 21 | 8 text layers, ViT-B/16, 4 multimodal layers |
uform-vl-multilingual |
206M | 12 | 8 text layers, ViT-B/16, 4 multimodal layers |
Generative Models
Model | Parameters | Purpose | Architecture |
---|---|---|---|
uform-gen2-qwen-500m |
1.2B | Chat, Image Captioning, VQA | qwen1.5-0.5B, ViT-H/14 |
uform-gen |
1.5B | Image Captioning, VQA | llama-1.3B, ViT-B/16 |
Quick Start
Once you pip install uform
, fetching the models is as easy as:
import uform
model = uform.get_model('unum-cloud/uform-vl-english') # Just English
model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages
Producing Embeddings
from PIL import Image
import torch.nn.functional as F
text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')
image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
similarity = F.cosine_similarity(image_embedding, text_embedding)
To search for similar items, the embeddings can be compared using cosine similarity.
The resulting value will fall within the range of -1
to 1
, where 1
indicates a high likelihood of a match.
Once the list of nearest neighbors (best matches) is obtained, the joint multimodal embeddings, created from both text and image features, can be used to better rerank (reorder) the list.
The model can calculate a "matching score" that falls within the range of [0, 1]
, where 1
indicates a high likelihood of a match.
joint_embedding = model.encode_multimodal(
image_features=image_features,
text_features=text_features,
attention_mask=text_data['attention_mask']
)
score = model.get_matching_scores(joint_embedding)
Chat, Image Captioning and Question Answering
The generative model can be used to caption images, answer questions about them. Also it is suitable for a multimodal chat.
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)
prompt = "Question or Instruction"
image = Image.open("image.jpg")
inputs = processor(text=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
output = model.generate(
**inputs,
do_sample=False,
use_cache=True,
max_new_tokens=256,
eos_token_id=151645,
pad_token_id=processor.tokenizer.pad_token_id
)
prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
You can check examples of different prompts in our demo space
Image Captioning and Question Answering
It is the instruction for the first version of UForm-Gen model. We highly recommend you use the new model, instructions for which you can find above.
The generative model can be used to caption images, summarize their content, or answer questions about them. The exact behavior is controlled by prompts.
from uform.gen_model import VLMForCausalLM, VLMProcessor
model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")
# [cap] Narrate the contents of the image with precision.
# [cap] Summarize the visual content of the image.
# [vqa] What is the main subject of the image?
prompt = "[cap] Summarize the visual content of the image."
image = Image.open("zebra.jpg")
inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
output = model.generate(
**inputs,
do_sample=False,
use_cache=True,
max_new_tokens=128,
eos_token_id=32001,
pad_token_id=processor.tokenizer.pad_token_id
)
prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
Multimodal Chat
The generative models can be used for chat-like experiences, where the user can provide both text and images as input. To use that feature, you can start with the following CLI command:
uform-chat --model unum-cloud/uform-gen-chat --image=zebra.jpg
uform-chat --model unum-cloud/uform-gen-chat \
--image="https://bit.ly/3tIVg9M" \
--device="cuda:0" \
--fp16
Multi-GPU
To achieve higher throughput, you can launch UForm on multiple GPUs.
For that pick the encoder of the model you want to run in parallel (text_encoder
or image_encoder
), and wrap it in nn.DataParallel
(or nn.DistributedDataParallel
).
import uform
model = uform.get_model('unum-cloud/uform-vl-english')
model_image = nn.DataParallel(model.image_encoder)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_image.to(device)
_, res = model_image(images, 0)
Evaluation
Embedding Models
Few retrieval benchmarks exist for multimodal embeddings.
The most famous ones for English are "MS-COCO" and "Flickr30k".
Evaluating uform-vl-english
model, one can expect the following numbers for search quality.
Dataset | Recall @ 1 | Recall @ 5 | Recall @ 10 |
---|---|---|---|
Flickr | 0.727 | 0.915 | 0.949 |
MS-COCO¹ | 0.510 | 0.761 | 0.838 |
For multilingual benchmarks, we've created the unum-cloud/coco-sm
repository².
Evaluating the unum-cloud/uform-vl-multilingual-v2
model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32
OpenCLIP model.
Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
---|---|---|---|---|---|---|---|
English 🇺🇸 | 37.8 | 37.7 | 63.5 | 65.0 | 73.5 | 75.9 | 1'452 M |
Chinese 🇨🇳 | 27.3 | 32.2 | 51.3 | 59.0 | 62.1 | 70.5 | 1'118 M |
Hindi 🇮🇳 | 20.7 | 31.3 | 42.5 | 57.9 | 53.7 | 69.6 | 602 M |
Spanish 🇪🇸 | 32.6 | 35.6 | 58.0 | 62.8 | 68.8 | 73.7 | 548 M |
Arabic 🇸🇦 | 22.7 | 31.7 | 44.9 | 57.8 | 55.8 | 69.2 | 274 M |
French 🇫🇷 | 31.3 | 35.4 | 56.5 | 62.6 | 67.4 | 73.3 | 274 M |
All languages.
Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
---|---|---|---|---|---|---|---|
Arabic 🇸🇦 | 22.7 | 31.7 | 44.9 | 57.8 | 55.8 | 69.2 | 274 M |
Armenian 🇦🇲 | 5.6 | 22.0 | 14.3 | 44.7 | 20.2 | 56.0 | 4 M |
Chinese 🇨🇳 | 27.3 | 32.2 | 51.3 | 59.0 | 62.1 | 70.5 | 1'118 M |
English 🇺🇸 | 37.8 | 37.7 | 63.5 | 65.0 | 73.5 | 75.9 | 1'452 M |
French 🇫🇷 | 31.3 | 35.4 | 56.5 | 62.6 | 67.4 | 73.3 | 274 M |
German 🇩🇪 | 31.7 | 35.1 | 56.9 | 62.2 | 67.4 | 73.3 | 134 M |
Hebrew 🇮🇱 | 23.7 | 26.7 | 46.3 | 51.8 | 57.0 | 63.5 | 9 M |
Hindi 🇮🇳 | 20.7 | 31.3 | 42.5 | 57.9 | 53.7 | 69.6 | 602 M |
Indonesian 🇮🇩 | 26.9 | 30.7 | 51.4 | 57.0 | 62.7 | 68.6 | 199 M |
Italian 🇮🇹 | 31.3 | 34.9 | 56.7 | 62.1 | 67.1 | 73.1 | 67 M |
Japanese 🇯🇵 | 27.4 | 32.6 | 51.5 | 59.2 | 62.6 | 70.6 | 125 M |
Korean 🇰🇷 | 24.4 | 31.5 | 48.1 | 57.8 | 59.2 | 69.2 | 81 M |
Persian 🇮🇷 | 24.0 | 28.8 | 47.0 | 54.6 | 57.8 | 66.2 | 77 M |
Polish 🇵🇱 | 29.2 | 33.6 | 53.9 | 60.1 | 64.7 | 71.3 | 41 M |
Portuguese 🇵🇹 | 31.6 | 32.7 | 57.1 | 59.6 | 67.9 | 71.0 | 257 M |
Russian 🇷🇺 | 29.9 | 33.9 | 54.8 | 60.9 | 65.8 | 72.0 | 258 M |
Spanish 🇪🇸 | 32.6 | 35.6 | 58.0 | 62.8 | 68.8 | 73.7 | 548 M |
Thai 🇹🇭 | 21.5 | 28.7 | 43.0 | 54.6 | 53.7 | 66.0 | 61 M |
Turkish 🇹🇷 | 25.5 | 33.0 | 49.1 | 59.6 | 60.3 | 70.8 | 88 M |
Ukranian 🇺🇦 | 26.0 | 30.6 | 49.9 | 56.7 | 60.9 | 68.1 | 41 M |
Vietnamese 🇻🇳 | 25.4 | 28.3 | 49.2 | 53.9 | 60.3 | 65.5 | 85 M |
Mean | 26.5±6.4 | 31.8±3.5 | 49.8±9.8 | 58.1±4.5 | 60.4±10.6 | 69.4±4.3 | - |
Google Translate | 27.4±6.3 | 31.5±3.5 | 51.1±9.5 | 57.8±4.4 | 61.7±10.3 | 69.1±4.3 | - |
Microsoft Translator | 27.2±6.4 | 31.4±3.6 | 50.8±9.8 | 57.7±4.7 | 61.4±10.6 | 68.9±4.6 | - |
Meta NLLB | 24.9±6.7 | 32.4±3.5 | 47.5±10.3 | 58.9±4.5 | 58.2±11.2 | 70.2±4.3 | - |
Generative Models
Model | LLM Size | SQA | MME | MMBench | Average¹ |
---|---|---|---|---|---|
UForm-Gen2-Qwen-500m | 0.5B | 45.5 | 880.1 | 42.0 | 29.31 |
MobileVLM v2 | 1.4B | 52.1 | 1302.8 | 57.7 | 36.81 |
LLaVA-Phi | 2.7B | 68.4 | 1335.1 | 59.8 | 42.95 |
For captioning evaluation we measure CLIPScore and RefCLIPScore³.
Model | Size | Caption Length | CLIPScore | RefCLIPScore |
---|---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | Long | 0.878 | 0.529 |
llava-hf/llava-1.5-7b-hf |
7B | Short | 0.886 | 0.531 |
Salesforce/instructblip-vicuna-7b |
7B | Long | 0.902 | 0.534 |
Salesforce/instructblip-vicuna-7b |
7B | Short | 0.848 | 0.523 |
unum-cloud/uform-gen |
1.5B | Long | 0.847 | 0.523 |
unum-cloud/uform-gen |
1.5B | Short | 0.842 | 0.522 |
unum-cloud/uform-gen-chat |
1.5B | Long | 0.860 | 0.525 |
unum-cloud/uform-gen-chat |
1.5B | Short | 0.858 | 0.525 |
Results for VQAv2 evaluation.
Model | Size | Accuracy |
---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | 78.5 |
unum-cloud/uform-gen |
1.5B | 66.5 |
¹ Train split was in training data.
² Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section.
³ We usedapple/DFN5B-CLIP-ViT-H-14-378
CLIP model.
Speed
On Nvidia RTX 3090, the following performance is expected on text encoding.
Model | Multilingual | Speed | Speedup |
---|---|---|---|
bert-base-uncased |
No | 1'612 sequences/second | |
distilbert-base-uncased |
No | 3'174 sequences/second | x 1.96 |
sentence-transformers/all-MiniLM-L12-v2 |
Yes | 3'604 sequences/second | x 2.24 |
unum-cloud/uform-vl-multilingual-v2 |
Yes | 6'809 sequences/second | x 4.22 |
On Nvidia RTX 3090, the following performance is expected on text token generation using float16
, equivalent PyTorch settings, and greedy decoding.
Model | Size | Speed | Speedup |
---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | ~ 40 tokens/second | |
Salesforce/instructblip-vicuna-7b |
7B | ~ 40 tokens/second | |
unum-cloud/uform-gen |
1.5B | ~ 140 tokens/second | x 3.5 |
Given the small size of the model it also work well on mobile devices. On Apple M2 Arm chips the energy efficiency of inference can exceed that of the RTX 3090 GPU and other Ampere-generation cards.
Device | Speed | Device TDP | Efficiency |
---|---|---|---|
Nvidia RTX 3090 | ~ 140 tokens/second | < 350W | 0.40 tokens/joule |
Apple M2 Pro unplugged | ~ 19 tokens/second | < 20W | 0.95 tokens/joule |
Apple M2 Max unplugged | ~ 38 tokens/second | < 36W | 1.06 tokens/joule |
Apple M2 Max plugged | ~ 56 tokens/second | < 89W | 0.63 tokens/joule |
[!WARNING] The above numbers are for reference only and are not guaranteed to be accurate.
License
All models come under the same license as the code - Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.