Skip to main content

Generate text captions for images from their CLIP embeddings.

Project description

clip-text-decoder

Train an image captioner with 0.323 BLEU on COCO Captions in under one hour! (0.352 BLEU with beam search 🙂)

Generates text captions for images from their embeddings. Now includes BLIP as an available vision backbone!

Example Predictions

Computed using the pretrained model mentioned below.


"A man riding a wave on top of a surfboard."



"A baseball player is swinging a bat at a ball."



"A dog jumping in the air to catch a frisbee."

Installation

Using pip:

pip install "clip @ git+https://github.com/openai/CLIP.git"
pip install "lavis @ git+https://github.com/salesforce/LAVIS.git"
pip install clip-text-decoder

From source:

pip install "clip @ git+https://github.com/openai/CLIP.git"
pip install "lavis @ git+https://github.com/salesforce/LAVIS.git"
git clone https://github.com/fkodom/clip-text-decoder.git
cd clip-text-decoder
pip install .

Inference

Pretrained Model

from PIL import Image
import torch

from clip_text_decoder.model import ImageCaptionInferenceModel

model = ImageCaptionInferenceModel.download_pretrained()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

image = Image.open("path/to/image.jpeg")
# The beam_size argument is optional. Larger beam_size is slower, but has
# slightly higher accuracy. Recommend using beam_size <= 3.
caption = model(image, beam_size=1)

To cache the pretrained model locally, so that it's not re-downloaded each time:

model = ImageCaptionInferenceModel.download_pretrained("path/to/model.pt")

Custom Trained Model

Training produces a model.pt archive, containing a Tokenizer and model parameters. To reload the trained inference model:

from clip_text_decoder.model import ImageCaptionInferenceModel

model = ImageCaptionInferenceModel.load("path/to/model.pt").to(device)
# Load image and get predictions like above...

Ablation: Beam Size

Measuring the BLEU-4 score for different beam_size arguments. By default, the inference model uses a beam size of 1:

from clip_text_decoder.model import ImageCaptionInferenceModel

model = ImageCaptionInferenceModel.load("path/to/model.pt")
caption = model(image, beam_size=1)

Using larger beam_size gives better BLEU score with a trade-off of slower inference speeds. The metrics below were collected from the same model, which uses a BLIP vision backbone and was trained for 10 epochs (roughly 1 hour on a T4 GPU):

Beam size BLEU-4
1 (default) 0.323
2 0.343
3 0.350
4 0.352

Training

Launch your own training session using train.py:

python train.py --max-epochs 10

Training CLI arguments, along with their default values:

--vision-backbone blip:base  # (str)
--language-model distilgpt2  # (str)
--max-epochs 10  # (int)
--beam-size 1  # (int)
--batch-size 32  # (int)
--accumulate-grad-batches 4  # (int)
--precision 16  # (16 or 32)
--seed 0  # (int)

One epoch takes about 5-6 minutes using a T4 GPU, which is usually free in Google Colab (depending on availability). After about 10 training epochs, you'll reach a BLEU-4 score just over 0.30 (without beam search). So, in under an hour, you can train a pretty good image captioning model. 😎

Notes

BLEU doesn't increase much beyond 1 hour of training. Training and validation loss will continue to decrease, but the resulting image captions are effectively equivalent.

This appears to be a limitation of the image embeddings, rather than a limitation of the language model. Changing the vision backbone gives the biggest improvement in BLEU score. (BLIP gets 5-10% better BLEU than CLIP backbones using the same language model head.) Larger language models (e.g. GPT-2 Large) don't improve the BLEU score by much.

TODO

  • Plan to train on Conceptual Captions for more generic image captioning.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clip-text-decoder-1.4.4.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

clip_text_decoder-1.4.4-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file clip-text-decoder-1.4.4.tar.gz.

File metadata

  • Download URL: clip-text-decoder-1.4.4.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for clip-text-decoder-1.4.4.tar.gz
Algorithm Hash digest
SHA256 f56c22e64783bcee884618f59ed738ae6d037ef5e82746399492356e7de93e6e
MD5 e6ef08d10a7855b77b6b85e9d1418373
BLAKE2b-256 e0fc6865c209965f96d67deac8147656f061c5579284a397b2b89fe89c162afe

See more details on using hashes here.

File details

Details for the file clip_text_decoder-1.4.4-py3-none-any.whl.

File metadata

File hashes

Hashes for clip_text_decoder-1.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 15cd549081b9fb35073cb00381660a01733b79b85ea329aad3f4cfab7d33d39e
MD5 420ac14780a83b79151fb9a052252fd7
BLAKE2b-256 191ff729afde22fa530de8669b4a37116afbb97cd6dd0bb7d7b2d392d0091185

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page