The code used to train and run inference with the ColPali architecture.
Project description
ColPali: Efficient Document Retrieval with Vision Language Models 👀
[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]
[!TIP] For production usage in your RAG pipelines, we recommend using the
byaldi
package, which is a lightweight wrapper around thecolpali-engine
package developed by the author of the popular RAGatouille repostiory. 🐭
Associated Paper
This repository contains the code used for training the vision retrievers in the ColPali: Efficient Document Retrieval with Vision Language Models paper. In particular, it contains the code for training the ColPali model, which is a vision retriever based on the ColBERT architecture and the PaliGemma model.
Introduction
With our new model ColPali, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.
Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.
List of ColVision models
Model | Score on ViDoRe 🏆 | License | Comments | Currently supported |
---|---|---|---|---|
vidore/colpali | 81.3 | Gemma | • Based on google/paligemma-3b-mix-448 .• Checkpoint used in the ColPali paper. |
❌ |
vidore/colpali-v1.1 | 81.5 | Gemma | • Based on google/paligemma-3b-mix-448 . |
✅ |
vidore/colpali-v1.2 | 83.1 | Gemma | • Based on google/paligemma-3b-mix-448 . |
✅ |
vidore/colqwen2-v0.1 | 86.6 | Apache 2.0 | • Based on Qwen/Qwen2-VL-2B-Instruct .• Supports dynamic resolution. • Trained using 768 image patches per page. |
✅ |
Setup
We used Python 3.11.6 and PyTorch 2.2.2 to train and test our models, but the codebase is compatible with Python >=3.9 and recent PyTorch versions. To install the package, run:
pip install colpali-engine
[!WARNING] For ColPali versions above v1.0, make sure to install the
colpali-engine
package from source or with a version above v0.2.0.
Usage
Quick start
import torch
from PIL import Image
from colpali_engine.models import ColPali, ColPaliProcessor
model_name = "vidore/colpali-v1.2"
model = ColPali.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0", # or "mps" if on Apple Silicon
).eval()
processor = ColPaliProcessor.from_pretrained(model_name)
# Your inputs
images = [
Image.new("RGB", (32, 32), color="white"),
Image.new("RGB", (16, 16), color="black"),
]
queries = [
"Is attention really all you need?",
"Are Benjamin, Antoine, Merve, and Jo best friends?",
]
# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
# Forward pass
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
Inference
You can find an example here. If you need an indexing system, we recommend using byaldi
- RAGatouille's little sister 🐭 - which share a similar API and leverages our colpali-engine
package.
Benchmarking
To benchmark ColPali to reproduce the results on the ViDoRe leaderboard, it is recommended to use the vidore-benchmark
package.
Interpretability with similarity maps
By superimposing the late interaction similarity maps on top of the original image, we can visualize the most salient image patches with respect to each term of the query, yielding interpretable insights into model focus zones.
To use the interpretability
module, you need to install the colpali-engine[interpretability]
package:
pip install colpali-engine[interpretability]
Then, after generating your embeddings with ColPali, use the following code to plot the similarity maps for each query token:
import torch
from PIL import Image
from colpali_engine.interpretability import (
get_similarity_maps_from_embeddings,
plot_all_similarity_maps,
)
from colpali_engine.models import ColPali, ColPaliProcessor
from colpali_engine.utils.torch_utils import get_torch_device
model_name = "vidore/colpali-v1.2"
device = get_torch_device("auto")
# Load the model
model = ColPali.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map=device,
).eval()
# Load the processor
processor = ColPaliProcessor.from_pretrained(model_name)
# Load the image and query
image = Image.open("shift_kazakhstan.jpg")
query = "Quelle partie de la production pétrolière du Kazakhstan provient de champs en mer ?"
# Preprocess inputs
batch_images = processor.process_images([image]).to(device)
batch_queries = processor.process_queries([query]).to(device)
# Forward passes
with torch.no_grad():
image_embeddings = model.forward(**batch_images)
query_embeddings = model.forward(**batch_queries)
# Get the number of image patches
n_patches = processor.get_n_patches(image_size=image.size, patch_size=model.patch_size)
# Get the tensor mask to filter out the embeddings that are not related to the image
image_mask = processor.get_image_mask(batch_images)
# Generate the similarity maps
batched_similarity_maps = get_similarity_maps_from_embeddings(
image_embeddings=image_embeddings,
query_embeddings=query_embeddings,
n_patches=n_patches,
image_mask=image_mask,
)
# Get the similarity map for our (only) input image
similarity_maps = batched_similarity_maps[0] # (query_length, n_patches_x, n_patches_y)
# Tokenize the query
query_tokens = processor.tokenizer.tokenize(query)
# Plot and save the similarity maps for each query token
plots = plot_all_similarity_maps(
image=image,
query_tokens=query_tokens,
similarity_maps=similarity_maps,
)
for idx, (fig, ax) in enumerate(plots):
fig.savefig(f"similarity_map_{idx}.png")
For a more detailed example, you can refer to the interpretability notebooks from the ColPali Cookbooks 👨🏻🍳 repository.
Training
To keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies to use the training script for ColPali. You can do this using the following command:
pip install "colpali-engine[train]"
All the model configs used can be found in scripts/configs/
and rely on the configue package for straightforward configuration. They should be used with the train_colbert.py
script.
Example 1: Local training
USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml
or using accelerate
:
accelerate launch scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml
Example 2: Training on a SLURM cluster
sbatch --nodes=1 --cpus-per-task=16 --mem-per-cpu=32GB --time=20:00:00 --gres=gpu:1 -p gpua100 --job-name=colidefics --output=colidefics.out --error=colidefics.err --wrap="accelerate launch scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml"
sbatch --nodes=1 --time=5:00:00 -A cad15443 --gres=gpu:8 --constraint=MI250 --job-name=colpali --wrap="python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml"
Paper result reproduction
To reproduce the results from the paper, you should checkout to the v0.1.1
tag or install the corresponding colpali-engine
package release using:
pip install colpali-engine==0.1.1
Citation
ColPali: Efficient Document Retrieval with Vision Language Models
Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file colpali_engine-0.3.2.tar.gz
.
File metadata
- Download URL: colpali_engine-0.3.2.tar.gz
- Upload date:
- Size: 123.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 639d3293dae8d478ffc1e091b3dbd7365c2284bc0aa59467ddfde7afc58d9c91 |
|
MD5 | 170afa5aeb18ab5d7f9ceaaf7b6a0869 |
|
BLAKE2b-256 | c2dd9da37b97e1f3517214aff881c20664626d8cd89f70d7c677b63fcf66d51e |
File details
Details for the file colpali_engine-0.3.2-py3-none-any.whl
.
File metadata
- Download URL: colpali_engine-0.3.2-py3-none-any.whl
- Upload date:
- Size: 37.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06a1916200033edb1da607db31585ac3a39d594d8644d96e46de6eb71356ff1b |
|
MD5 | c4ffc01344b5d7471b6d4613d88cc413 |
|
BLAKE2b-256 | 189f6aeb31aef8aad06b869461b8d7135296fcc25cfce9800bdb51e8998ec5ba |