Skip to main content

Evaluating Text-to-Visual Generation with Image-to-Text Generation.

Project description

VQAScore: Evaluating Text-to-Visual Generation with Image-to-Text Generation [Project Page]

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan. In Arxiv, 2024.

VQAScore allows researchers to automatically evaluate text-to-image/video/3D models using one-line of Python code!

VQAScore significantly outperforms previous metrics such as CLIPScore and PickScore on compositional text prompts, and it is much simpler than prior art (e.g., ImageReward, HPSv2, TIFA, Davidsonian, VPEval, VIEScore) making use of human feedback or proprietary models like ChatGPT and GPT-4Vision.

Quick start

Install the package via:

git clone https://github.com/linzhiqiu/t2v_metrics
cd t2v_metrics

conda create -n t2v python=3.10 -y
conda activate t2v
conda install pip -y

pip install torch torchvision torchaudio
pip install git+https://github.com/openai/CLIP.git
pip install -e . # local pip install

Or you can install via pip install t2v-metrics.

Now, the following Python code is all you need to compute the VQAScore for image-text alignment (higher scores indicate greater similarity):

import t2v_metrics
clip_flant5_score = t2v_metrics.VQAScore(model='clip-flant5-xxl') # our recommended scoring model

### For a single (image, text) pair
image = "images/0.png" # an image path in string format
text = "someone talks on the phone angrily while another person sits happily"
score = clip_flant5_score(images=[image], texts=[text])

### Alternatively, if you want to calculate the pairwise similarity scores 
### between M images and N texts, run the following to return a M x N score tensor.
images = ["images/0.png", "images/1.png"]
texts = ["someone talks on the phone angrily while another person sits happily",
         "someone talks on the phone happily while another person sits angrily"]
scores = clip_flant5_score(images=images, texts=texts) # scores[i][j] is the score between image i and text j

Notes on GPU and cache

  • GPU usage: By default, this code uses the first cuda device on your machine. We recommend 40GB GPUs for the largest VQAScore models such as clip-flant5-xxl and llava-v1.5-13b. If you have limited GPU memory, consider smaller models such as clip-flant5-xl and llava-v1.5-7b.
  • Cache directory: You can change the cache folder which saves all model checkpoints (default is ./hf_cache/) by updating HF_CACHE_DIR in t2v_metrics/constants.py.

Advanced Usage

Batch processing for more image-text pairs

With a large batch of M images x N texts, you can speed up using the batch_forward() function.

import t2v_metrics
clip_flant5_score = t2v_metrics.VQAScore(model='clip-flant5-xxl')

# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
  {'images': ["images/0/DALLE3.png", "images/0/Midjourney.jpg", "images/0/SDXL.jpg", "images/0/DeepFloyd.jpg"], 'texts': ["The brown dog chases the black dog around the tree."]},
  {'images': ["images/1/DALLE3.png", "images/1/Midjourney.jpg", "images/1/SDXL.jpg", "images/1/DeepFloyd.jpg"], 'texts': ["Two cats sit at the window, the blue one intently watching the rain, the red one curled up asleep."]},
  #...
]
scores = clip_flant5_score.batch_forward(dataset=dataset, batch_size=16) # (n_sample, 4, 1) tensor

Check all supported models

We currently support running VQAScore with CLIP-FlanT5, LLaVA-1.5, and InstructBLIP. For ablation, we also include CLIPScore, BLIPv2Score, PickScore, HPSv2Score, and ImageReward:

llava_score = t2v_metrics.VQAScore(model='llava-v1.5-13b')
instructblip_score = t2v_metrics.VQAScore(model='instructblip-flant5-xxl')
clip_score = t2v_metrics.CLIPScore(model='openai:ViT-L-14-336')
blip_itm_score = t2v_metrics.ITMScore(model='blip2-itm') 
pick_score = t2v_metrics.CLIPScore(model='pickscore-v1')
hpsv2_score = t2v_metrics.CLIPScore(model='hpsv2') 
image_reward_score = t2v_metrics.ITMScore(model='image-reward-v1') 

You can check all supported models by running the below commands:

print("VQAScore models:")
t2v_metrics.list_all_vqascore_models()

print("ITMScore models:")
t2v_metrics.list_all_itmscore_models()

print("CLIPScore models:")
t2v_metrics.list_all_clipscore_models()

Customizing the question and answer template (for VQAScore)

The question and answer slightly affect the final score, as shown in the Appendix of our paper. We provide a simple default template for each model and do not recommend changing it for the sake of reproducibility. However, we do want to point out that the question and answer can be easily modified. For example, CLIP-FlanT5 and LLaVA-1.5 use the following template, which can be found at t2v_metrics/models/vqascore_models/clip_t5_model.py:

# {} will be replaced by the caption
default_question_template = 'Does this figure show "{}"? Please answer yes or no.'
default_answer_template = 'Yes'

You can customize the template by passing the question_template and answer_template parameters into the forward() or batch_forward() functions:

# Use a different question for VQAScore
scores = clip_flant5_score(images=images,
                           texts=texts,
                           question_template='Is this figure showing "{}"? Please answer yes or no.',
                           answer_template='Yes')

You may also compute P(caption | image) (VisualGPTScore) instead of P(answer | image, question):

scores = clip_flant5_score(images=images,
                           texts=texts,
                           question_template="", # no question
                           answer_template="{}") # this computes P(caption | image)

Reproducing paper results

Our eval.py allows you to easily run 10 image/vision/3D alignment benchmarks (e.g., Winoground/TIFA160/SeeTrue/StanfordT23D/T2VScore):

python eval.py --model clip-flant5-xxl # for VQAScore
python eval.py --model openai:ViT-L-14 # for CLIPScore

# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question "Is the figure showing '{}'?" --answer "Yes"

Implementing your own scoring metric

You can easily implement your own scoring metric. For example, if you have a VQA model that you believe is more effective, you can incorporate it into the directory at t2v_metrics/models/vqascore_models. For guidance, please refer to our example implementations of LLaVA-1.5 and InstructBLIP as starting points.

Citation

If you find this repository useful for your research, please use the following (TO UPDATE with ArXiv ID).

@article{lin2024evaluating,
  title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
  author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
  journal={arXiv preprint arXiv:2404.01291},
  year={2024}
}

Acknowledgements

This repository is inspired from the Perceptual Metric (LPIPS) repository by Richard Zhang for automatic evaluation of image quality.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

t2v_metrics-1.0.1.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

t2v_metrics-1.0.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file t2v_metrics-1.0.1.tar.gz.

File metadata

  • Download URL: t2v_metrics-1.0.1.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for t2v_metrics-1.0.1.tar.gz
Algorithm Hash digest
SHA256 1bd028b220160acc6d40de4782947c8269963202b9e453abbfbac9d243bff451
MD5 312e356a77aed0069df8d8ce507779ea
BLAKE2b-256 ce5d5208fad487a509045dcfc241af6a6a3bd0fd196df7be6a6571de682728d1

See more details on using hashes here.

File details

Details for the file t2v_metrics-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: t2v_metrics-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for t2v_metrics-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8050d703b1e12ac7472d774fa45988f66dcd0b8b9466ea204e53d63c54b50878
MD5 ec1b62a2e16463bb6b35d8bd7bb3cceb
BLAKE2b-256 4b1730e7082648b96ca1e9aeb47c1aa8c48e37dda6ca4867587d29c4888b16b1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page