Skip to main content

Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.

Project description

Vision Document Retrieval (ViDoRe): Benchmark ๐Ÿ‘€

arXiv GitHub Hugging Face

Test Version Downloads


[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]

Approach

The Visual Document Retrieval Benchmark (ViDoRe), is introduced to evaluate the performance of document retrieval systems on visually rich documents across various tasks, domains, languages, and settings. It was used to evaluate the ColPali model, a VLM-powered retriever that efficiently retrieves documents based on their visual content and textual queries using a late-interaction mechanism.

ViDoRe Examples

[!TIP] If you want to fine-tune ColPali for your specific use-case, you should check the colpali repository. It contains with the whole codebase used to train the model presented in our paper.

Setup

We used Python 3.11.6 and PyTorch 2.2.2 to train and test our models, but the codebase is expected to be compatible with Python >=3.9 and recent PyTorch versions.

The eval codebase depends on a few Python packages, which can be downloaded using the following command:

pip install vidore-benchmark

To keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies for the specific non-Transformers models you want to run (see the list in pyproject.toml). For instance, if you are going to evaluate the BGE-M3 retriever:

pip install "vidore-benchmark[bge-m3]"

Or if you want to evaluate all the off-the-shelf retrievers:

pip install "vidore-benchmark[all-retrievers]"

Finally, if you are willing to reproduce the results from the ColPali paper, you should clone the repository, checkout to the 3.3.0 tag or below, and use the requirements-dev.txt file to install the dependencies used at test time:

pip install -r requirements-dev.txt

Available retrievers

The list of available retrievers can be found here. Read this section to learn how to create, use, and evaluate your own retriever.

Command-line usage

Evaluate a retriever on ViDoRE

You can evaluate any off-the-shelf retriever on the ViDoRe benchmark. For instance, you can evaluate the ColPali model on the ViDoRe benchmark to reproduce the results from our paper.

vidore-benchmark evaluate-retriever \
    --model-name vidore/colpali-v1.2 \
    --collection-name "vidore/vidore-benchmark-667173f98e70a1c0fa4db00d" \
    --split test

Note: You should get a warning about some non-initialized weights. This is a known issue in ColPali and will cause the metrics to be slightly different from the ones reported in the paper. We are working on fixing this issue.

Alternatively, you can evaluate your model on a single dataset. If your retriver uses visual embeddings, you can use any dataset path from the ViDoRe Benchmark collection, e.g.:

vidore-benchmark evaluate-retriever \
    --model-name vidore/colpali-v1.2 \
    --dataset-name vidore/docvqa_test_subsampled \
    --split test

If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the ViDoRe Chunk OCR (baseline) instead:

vidore-benchmark evaluate-retriever \
    --model-name BAAI/bge-m3 \
    --dataset-name vidore/docvqa_test_subsampled_tesseract \
    --split test

Both scripts will generate one particular JSON file in outputs/{model_name_all_metrics.json}. Follow the instructions on the ViDoRe Leaderboard to compare your model with the others.

Evaluate a retriever using token pooling

You can use token pooling to reduce the length of the document embeddings. In production, this will significantly reduce the memory footprint of the retriever, thus reducing costs and increasing speed. You can use the --use-token-pooling flag to enable this feature:

vidore-benchmark evaluate-retriever \
    --model-name vidore/colpali-v1.2 \
    --dataset-name vidore/docvqa_test_subsampled \
    --split test \
    --use-token-pooling \
    --pool-factor 3

Retrieve the top-k documents from a HuggingFace dataset

vidore-benchmark retrieve-on-dataset \
    --model-name vidore/colpali-v1.2 \
    --query "Which hour of the day had the highest overall electricity generation in 2019?" \
    --k 5 \
    --dataset-name vidore/syntheticDocQA_energy_test \
    --split test

Retrieve the top-k documents from a collection of PDF documents

vidore-benchmark retriever_on_pdfs \
    --model-name google/siglip-so400m-patch14-384 \
    --query "Which hour of the day had the highest overall electricity generation in 2019?" \
    --k 5 \
    --data-dirpath data/my_folder_with_pdf_documents/

Documentation

To get more information about the available options, run:

โฏ vidore-benchmark --help
                                                                                                                      
 Usage: vidore-benchmark [OPTIONS] COMMAND [ARGS]...                                                                       
                                                                                                                      
 CLI for evaluating retrievers on the ViDoRe benchmark.                                                               
                                                                                                                      
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --install-completion          Install completion for the current shell.                                            โ”‚
โ”‚ --show-completion             Show completion for the current shell, to copy it or customize the installation.     โ”‚
โ”‚ --help                        Show this message and exit.                                                          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ evaluate-retriever    Evaluate the retriever on the given dataset or collection. The metrics are saved to a JSON   โ”‚
โ”‚                       file.                                                                                        โ”‚
โ”‚ retrieve-on-dataset   Retrieve the top-k documents according to the given query.                                   โ”‚
โ”‚ retrieve-on-pdfs      This script is used to ask a query and retrieve the top-k documents from a given folder      โ”‚
โ”‚                       containing PDFs. The PDFs will be converted to a dataset of image pages and then used for    โ”‚
โ”‚                       retrieval.                                                                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Python usage

Quickstart example

from typing import cast

from datasets import Dataset, load_dataset
from dotenv import load_dotenv
from vidore_benchmark.evaluation.evaluate import evaluate_dataset
from vidore_benchmark.retrievers.jina_clip_retriever import JinaClipRetriever

load_dotenv(override=True)

def main():
    """
    Example script for a Python usage of the Vidore Benchmark.
    """
    my_retriever = JinaClipRetriever()
    dataset = cast(Dataset, load_dataset("vidore/syntheticDocQA_dummy", split="test"))
    metrics = evaluate_dataset(my_retriever, dataset, batch_query=4, batch_doc=4)
    print(metrics)

Implement your own retriever

If you need to evaluate your own model on the ViDoRe benchmark, you can create your own instance of VisionRetriever to use it with the evaluation scripts in this package. You can find the detailed instructions here.

Compare retrievers using the EvalManager

To easily process, visualize and compare the evaluation metrics of multiple retrievers, you can use the EvalManager class. Assume you have a list of previously generated JSON metric files, e.g.:

data/metrics/
โ”œโ”€โ”€ bisiglip.json
โ””โ”€โ”€ colpali.json

The data is stored in eval_manager.data as a multi-column DataFrame with the following columns. Use the get_df_for_metric, get_df_for_dataset, and get_df_for_model methods to get the subset of the data you are interested in. For instance:

from vidore_benchmark.evaluation.eval_manager import EvalManager

eval_manager = EvalManager.from_dir("data/metrics/")
df = eval_manager.get_df_for_metric("ndcg_at_5")

Citation

ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Cรฉline Hudelot, Pierre Colombo (* denotes equal contribution)

@misc{faysse2024colpaliefficientdocumentretrieval,
      title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
      author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Cรฉline Hudelot and Pierre Colombo},
      year={2024},
      eprint={2407.01449},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.01449}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vidore_benchmark-4.0.2.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

vidore_benchmark-4.0.2-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file vidore_benchmark-4.0.2.tar.gz.

File metadata

  • Download URL: vidore_benchmark-4.0.2.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for vidore_benchmark-4.0.2.tar.gz
Algorithm Hash digest
SHA256 603dc5c205a5b55807f598cd6f2e36591e08cd2c75569988dc981071a4059e03
MD5 53626464b932779f531fd0b832c77b96
BLAKE2b-256 d99a512f59d2cf3f1e22d086db79dc8e2ee043ce12bcd7bea0da5f5dfe309bd5

See more details on using hashes here.

File details

Details for the file vidore_benchmark-4.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for vidore_benchmark-4.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 36284e47aabaa0ca87e277f0cd68110b5be60e5c4f22c48ab085cfa94190abf5
MD5 6280aea3e07283e9e8dead047c611c33
BLAKE2b-256 94076e6b4b4505969cae7a0f77a89ca75e8955ff586e23883358b3bef57e67ae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page