The first reverse image RAG API for image captioning and visual question answering with GPT-4V.
Project description
Reverse Image RAG - (RIR)
Synopsis:
We build an API to retrieval-augment vision-language models with visual context retrieved from the web.
Concretely, for a query image and query text (e.g. a question), we leverage reverse image search to find most similar images and their titles / captions.
The final product is a VLM-API that allows to automatically leverage reverse-image-search based retrieval augmentation.
Usage:
pip install rir_api
import rir_api
api = rir_api.RIR_API(openai_api_key)
image_url = "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSgN8RDkURVE8mgOf-n02TqJdC2l1o5cVFA32NpZtuVp8MaFfZY"
query_text = "What is in this image?"
response = api.query_with_image(image_url, query_text)
# >> runs reverse image search
# >> formats visual context prompt
# >> queries VLM with full query
(see run.py for minimal example)
Debug mode:
For debugging, you can make API calls that display the web GUI (headless=True), and plot the image search result (show_result=True):
response = api.query_with_image(image_url, query_text, show_result=True, delay=3, headless=False)
Next steps
- modularized API interface
- information extraction from search results
Feel free to ping me under mdmoor[at]cs.stanford.edu if you're interested in contributing.
Reference:
@misc{Moor2024,
author = {Michael Moor},
title = {Reverse Image RAG~(RIR)},
year = {2024},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/mi92/reverse-image-rag}},
}
More teaser examples:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.