Skip to main content

The code used to train and run inference with the ColPali architecture.

Project description

ColPali: Efficient Document Retrieval with Vision Language Models 👀

arXiv GitHub Hugging Face GitHub

Test Version Downloads


[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]

Associated Paper

This repository contains the code used for training the vision retrievers in the ColPali: Efficient Document Retrieval with Vision Language Models paper. In particular, it contains the code for training the ColPali model, which is a vision retriever based on the ColBERT architecture and the PaliGemma model.

Introduction

With our new model ColPali, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.

Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.

ColPali Architecture

List of ColVision models

Model Score on ViDoRe 🏆 License Comments Currently supported
vidore/colpali 81.3 Gemma • Based on google/paligemma-3b-mix-448.
• Checkpoint used in the ColPali paper.
vidore/colpali-v1.1 81.5 Gemma • Based on google/paligemma-3b-mix-448.
vidore/colpali-v1.2 83.1 Gemma • Based on google/paligemma-3b-mix-448.
vidore/colqwen2-v0.1 86.6 Apache 2.0 • Based on Qwen/Qwen2-VL-2B-Instruct.
• Supports dynamic resolution.
• Trained using 768 image patches per page.

Setup

We used Python 3.11.6 and PyTorch 2.2.2 to train and test our models, but the codebase is compatible with Python >=3.9 and recent PyTorch versions. To install the package, run:

pip install colpali-engine

[!WARNING] For ColPali versions above v1.0, make sure to install the colpali-engine package from source or with a version above v0.2.0.

Usage

Quick start

import torch
from PIL import Image

from colpali_engine.models import ColPali, ColPaliProcessor

model_name = "vidore/colpali-v1.2"

model = ColPali.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
).eval()

processor = ColPaliProcessor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "Are Benjamin, Antoine, Merve, and Jo best friends?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Inference

You can find an example here.

Benchmarking

To benchmark ColPali to reproduce the results on the ViDoRe leaderboard, it is recommended to use the vidore-benchmark package.

Interpretability with similarity maps

By superimposing the late interaction similarity maps on top of the original image, we can visualize the most salient image patches with respect to each term of the query, yielding interpretable insights into model focus zones.

To use the interpretability module, you need to install the colpali-engine[interpretability] package:

pip install colpali-engine[interpretability]

Then, after generating your embeddings with ColPali, use the following code to plot the similarity maps for each query token:

import torch
from PIL import Image

from colpali_engine.interpretability import (
    get_similarity_maps_from_embeddings,
    plot_all_similarity_maps,
)
from colpali_engine.models import ColPali, ColPaliProcessor
from colpali_engine.utils.torch_utils import get_torch_device

model_name = "vidore/colpali-v1.2"
device = get_torch_device("auto")

# Load the model
model = ColPali.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
).eval()

# Load the processor
processor = ColPaliProcessor.from_pretrained(model_name)

# Load the image and query
image = Image.open("shift_kazakhstan.jpg")
query = "Quelle partie de la production pétrolière du Kazakhstan provient de champs en mer ?"

# Preprocess inputs
batch_images = processor.process_images([image]).to(device)
batch_queries = processor.process_queries([query]).to(device)

# Forward passes
with torch.no_grad():
    image_embeddings = model.forward(**batch_images)
    query_embeddings = model.forward(**batch_queries)

# Get the number of image patches
n_patches = processor.get_n_patches(image_size=image.size, patch_size=model.patch_size)

# Get the tensor mask to filter out the embeddings that are not related to the image
image_mask = processor.get_image_mask(batch_images)

# Generate the similarity maps
batched_similarity_maps = get_similarity_maps_from_embeddings(
    image_embeddings=image_embeddings,
    query_embeddings=query_embeddings,
    n_patches=n_patches,
    image_mask=image_mask,
)

# Get the similarity map for our (only) input image
similarity_maps = batched_similarity_maps[0]  # (query_length, n_patches_x, n_patches_y)

# Tokenize the query
query_tokens = processor.tokenizer.tokenize(query)

# Plot and save the similarity maps for each query token
plots = plot_all_similarity_maps(
    image=image,
    query_tokens=query_tokens,
    similarity_maps=similarity_maps,
)
for idx, (fig, ax) in enumerate(plots):
    fig.savefig(f"similarity_map_{idx}.png")

For a more detailed example, you can refer to the interpretability notebooks from the ColPali Cookbooks 👨🏻‍🍳 repository.

Training

To keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies to use the training script for ColPali. You can do this using the following command:

pip install "colpali-engine[train]"

All the model configs used can be found in scripts/configs/ and rely on the configue package for straightforward configuration. They should be used with the train_colbert.py script.

Example 1: Local training

USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml

or using accelerate:

accelerate launch scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml

Example 2: Training on a SLURM cluster

sbatch --nodes=1 --cpus-per-task=16 --mem-per-cpu=32GB --time=20:00:00 --gres=gpu:1  -p gpua100 --job-name=colidefics --output=colidefics.out --error=colidefics.err --wrap="accelerate launch scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml"

sbatch --nodes=1  --time=5:00:00 -A cad15443 --gres=gpu:8  --constraint=MI250 --job-name=colpali --wrap="python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml"

Community Projects

Several community projects and ressources have been developed around ColPali to facilitate its usage. Feel free to reach out if you want to add your project to this list!

Libraries 📚

Library Name Description
Byaldi Byaldi is RAGatouille's equivalent for ColPali, leveraging the colpali-engine package to facilitate indexing and storing embeddings.
PyVespa PyVespa allows interaction with Vespa, a production-grade vector database, with detailed ColPali support.
Candle Candle enables ColPali inference with an efficient ML framework for Rust.
DocAI DocAI uses ColPali with GPT-4o and Langchain to extract structured information from documents.
VARAG VARAG uses ColPali in a vision-only and a hybrid RAG pipeline.
ColBERT Live! ColBERT Live! enables ColPali usage with vector databases supporting large datasets, compression, and non-vector predicates.

Notebooks 📙

Notebook Title Author & Link
ColPali Cookbooks Tony's Cookbooks (ILLUIN) 🙋🏻
Vision RAG Tutorial Manu's Vision Rag Tutorial (ILLUIN) 🙋🏻
ColPali + Qwen2-VL for RAG Merve's Notebook (HuggingFace 🤗)
Weaviate Tutorial Connor's ColPali POC (Weaviate)
Data Generation Daniel's Notebook (HuggingFace 🤗)
Indexing ColPali with Qdrant Daniel's Notebook (HuggingFace 🤗)
Finance Report Analysis with ColPali and Gemini Jaykumaran (LearnOpenCV)

Other resources

  • 📝 = blog post
  • 📋 = PDF / slides
  • 📹 = video
Title Author & Link
LlamaIndex Webinar: ColPali - Efficient Document Retrieval with Vision Language Models LlamaIndex's Youtube video 📹
PDF Retrieval with Vision Language Models Jo's blog post #1 (Vespa) 📝
Scaling ColPali to billions of PDFs with Vespa Jo's blog post #2 (Vespa) 📝
Multimodal Document RAG with Llama 3.2 Vision and ColQwen2 Zain's blog post (Together AI) 📝
ColPali: Document Retrieval with Vision Language Models Antaripa's Notion blog post 📝
Minimalist diagrams explaining ColPali Leonie's ColPali diagrams on X 📝
Multimodal RAG with ColPali and Gemini : Financial Report Analysis Application Jaykumaran's blog post (LearnOpenCV) 📝
Implement Multimodal RAG with ColPali and Vision Language Model Groq(Llava) and Qwen2-VL Plaban's blog post 📝
State of AI report 2024 Nathan's report 📋
multimodal AI. open-source. in a nutshell. Merve's Youtube video 📹
Technology Radar Volume 31 (October 2024) thoughtworks's report 📋
Remove Complexity from Your RAG Applications Kyryl's Blog post (KOML) 📝
Late interaction & efficient Multi-modal retrievers need more than a vector index Ayush Chaurasia (LanceDB) 📝

Paper result reproduction

To reproduce the results from the paper, you should checkout to the v0.1.1 tag or install the corresponding colpali-engine package release using:

pip install colpali-engine==0.1.1

Citation

ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)

@misc{faysse2024colpaliefficientdocumentretrieval,
      title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
      author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
      year={2024},
      eprint={2407.01449},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.01449}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colpali_engine-0.3.3.tar.gz (128.2 kB view details)

Uploaded Source

Built Distribution

colpali_engine-0.3.3-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file colpali_engine-0.3.3.tar.gz.

File metadata

  • Download URL: colpali_engine-0.3.3.tar.gz
  • Upload date:
  • Size: 128.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for colpali_engine-0.3.3.tar.gz
Algorithm Hash digest
SHA256 43e49ba43dcf2640727e67c8451eec03b60b15674ee574e8b397201ca83022f3
MD5 68270817490cf0708e42929663a93fab
BLAKE2b-256 997c9b95c67644b81e8e039a7a73f5a09b78e95d99136458e2d0d93db67a5cb6

See more details on using hashes here.

File details

Details for the file colpali_engine-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for colpali_engine-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9d8b16de6da7e3482dda1c4b3f525323adc3623f68662a445db2441e90ad0fac
MD5 fcece05d39e0dea66bed5b20f8c70393
BLAKE2b-256 04cc993eb1eecbceb494335c0e9612ca6e01fe7dcf5f6c89e3a503b1dc11cd9a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page