Visual Document Retrieval (ViDoRe): Benchmark
Reason this release was yanked:
Typo in PR#26.
Project description
Vision Document Retrieval (ViDoRe): Benchmark ๐
[Model card] [ViDoRe Benchmark] [ViDoRe Leaderboard] [Demo] [Blog Post]
Main contributors: Manuel Faysse, Hugues Sibille, Tony Wu
Approach
The Visual Document Retrieval Benchmark (ViDoRe), is introduced to evaluate and enhance the performance of document retrieval systems on visually rich documents across various tasks, domains, languages, and settings. It was used to evaluate the ColPali model, a VLM-powered retriever that efficiently retrieves documents based on their visual content and textual queries.
[!TIP] If you want to fine-tune ColPali for your specific use-case, you should check the
colpali
repository. It contains with the whole codebase used to train the model presented in our paper.
Setup
We used Python 3.11.6 and PyTorch 2.2.2 to train and test our models, but the codebase is expected to be compatible with Python >=3.9 and recent PyTorch versions.
The eval codebase depends on a few Python packages, which can be downloaded using the following command:
pip install vidore-benchmark
To keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies for the specific non-Transformers models you want to run (see the list in pyproject.toml
). For instance, if you are going to evaluate the BGE-M3 retriever:
pip install "vidore-benchmark[bge-m3]"
Or if you want to evaluate all the off-the-shelf retrievers:
pip install "vidore-benchmark[all-retrievers]"
Finally, if you are willing to reproduce the results from the ColPali paper, you should clone the repository, checkout to the 3.3.0
tag or below, and use the requirements-dev.txt
file to install the dependencies used at test time:
pip install -r requirements-dev.txt
Available retrievers
The list of available retrievers can be found here. Read this section to learn how to create, use, and evaluate your own retriever.
Command-line usage
Evaluate a retriever on ViDoRE
You can evaluate any off-the-shelf retriever on the ViDoRe benchmark. For instance, you can evaluate the ColPali model on the ViDoRe benchmark to reproduce the results from our paper.
vidore-benchmark evaluate-retriever \
--model-name vidore/colpali \
--collection-name "vidore/vidore-benchmark-667173f98e70a1c0fa4db00d" \
--split test
Note: You should get a warning about some non-initialized weights. This is a known issue in ColPali and will cause the metrics to be slightly different from the ones reported in the paper. We are working on fixing this issue.
Alternatively, you can evaluate your model on a single dataset. If your retriver uses visual embeddings, you can use any dataset path from the ViDoRe Benchmark collection, e.g.:
vidore-benchmark evaluate-retriever \
--model-name vidore/colpali \
--dataset-name vidore/docvqa_test_subsampled \
--split test
If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the ViDoRe Chunk OCR (baseline) instead:
vidore-benchmark evaluate-retriever \
--model-name BAAI/bge-m3 \
--dataset-name vidore/docvqa_test_subsampled_tesseract \
--split test
Both scripts will generate one particular JSON file in outputs/{model_name_all_metrics.json}
. Follow the instructions on the ViDoRe Leaderboard to compare your model with the others.
Evaluate a retriever using embedding compression techniques
Binarization (experimental)
Binarization (or binary quantization) converts the float32 values in an embedding into 1-bit values, leading to a 32x decrease in memory and storage requirements. See this HuggingFace blog post for more information on binarization. To apply binarization on your embeddings, you can use the --quantization binarize
flag:
vidore-benchmark evaluate-retriever \
--model-name vidore/colpali \
--dataset-name vidore/docvqa_test_subsampled \
--split test \
--quantization binarize
Int8 quantization (experimental)
Int8 quantization maps the continuous float32 value range to a discrete set of int8 values, capable of representing 256 distinct levels (ranging from -128 to 127). The mapping bins are computed using the minimum and maximum values for each embedding dimension. See this HuggingFace blog post for more information on int8 quantization. To apply int8 quantization on your embeddings, you can use the --quantization int8
flag:
vidore-benchmark evaluate-retriever \
--model-name vidore/colpali \
--dataset-name vidore/docvqa_test_subsampled \
--split test \
--quantization int8
Token pooling
You can use token pooling to reduce the length of the document embeddings. In production, this will significantly reduce the memory footprint of the retriever, thus reducing costs and increasing speed. You can use the --use-token-pooling
flag to enable this feature:
vidore-benchmark evaluate-retriever \
--model-name vidore/colpali \
--dataset-name vidore/docvqa_test_subsampled \
--split test \
--use-token-pooling \
--pool-factor 3
Retrieve the top-k documents from a HuggingFace dataset
vidore-benchmark retrieve-on-dataset \
--model-name vidore/colpali \
--query "Which hour of the day had the highest overall electricity generation in 2019?" \
--k 5 \
--dataset-name vidore/syntheticDocQA_energy_test \
--split test
Retrieve the top-k documents from a collection of PDF documents
vidore-benchmark retriever_on_pdfs \
--model-name google/siglip-so400m-patch14-384 \
--query "Which hour of the day had the highest overall electricity generation in 2019?" \
--k 5 \
--data-dirpath data/my_folder_with_pdf_documents/
Documentation
To get more information about the available options, run:
โฏ vidore-benchmark --help
Usage: vidore-benchmark [OPTIONS] COMMAND [ARGS]...
CLI for evaluating retrievers on the ViDoRe benchmark.
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to copy it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ evaluate-retriever Evaluate the retriever on the given dataset or collection. The metrics are saved to a JSON โ
โ file. โ
โ retrieve-on-dataset Retrieve the top-k documents according to the given query. โ
โ retrieve-on-pdfs This script is used to ask a query and retrieve the top-k documents from a given folder โ
โ containing PDFs. The PDFs will be converted to a dataset of image pages and then used for โ
โ retrieval. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Python usage
Quickstart example
from typing import cast
from datasets import Dataset, load_dataset
from dotenv import load_dotenv
from vidore_benchmark.evaluation.evaluate import evaluate_dataset
from vidore_benchmark.retrievers.jina_clip_retriever import JinaClipRetriever
load_dotenv(override=True)
def main():
"""
Example script for a Python usage of the Vidore Benchmark.
"""
my_retriever = JinaClipRetriever()
dataset = cast(Dataset, load_dataset("vidore/syntheticDocQA_dummy", split="test"))
metrics = evaluate_dataset(my_retriever, dataset, batch_query=4, batch_doc=4)
print(metrics)
Implement your own retriever
If you need to evaluate your own model on the ViDoRe benchmark, you can create your own instance of VisionRetriever
to use it with the evaluation scripts in this package. You can find the detailed instructions here.
Compare retrievers using the EvalManager
To easily process, visualize and compare the evaluation metrics of multiple retrievers, you can use the EvalManager
class. Assume you have a list of previously generated JSON metric files, e.g.:
data/metrics/
โโโ bisiglip.json
โโโ colpali.json
The data is stored in eval_manager.data
as a multi-column DataFrame with the following columns. Use the get_df_for_metric
, get_df_for_dataset
, and get_df_for_model
methods to get the subset of the data you are interested in. For instance:
from vidore_benchmark.evaluation.eval_manager import EvalManager
eval_manager = EvalManager.from_dir("data/metrics/")
df = eval_manager.get_df_for_metric("ndcg_at_5")
Show the similarity maps for interpretability
By superimposing the late interaction heatmap on top of the original image, we can visualize the most salient image patches with respect to each term of the query, yielding interpretable insights into model focus zones.
You can generate similarity maps using the generate-similarity-maps
. For instance, you can reproduce the similarity maps from the paper using the images from data/interpretability_examples
and by running the following command. You can also feed multiple documents and queries at once to generate multiple similarity maps.
generate-similarity-maps \
--documents "data/interpretability_examples/energy_electricity_generation.jpeg" \
--queries "Which hour of the day had the highest overall electricity generation in 2019?" \
--documents "data/interpretability_examples/shift_kazakhstan.jpg" \
--queries "Quelle partie de la production pรฉtroliรจre du Kazakhstan provient de champs en mer ?"
Citation
ColPali: Efficient Document Retrieval with Vision Language Models
- First authors: Manuel Faysse*, Hugues Sibille*, Tony Wu* (*Equal Contribution)
- Contributors: Bilel Omrani, Gautier Viaud, Cรฉline Hudelot, Pierre Colombo
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Cรฉline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for vidore_benchmark-3.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e42285c9330cd5050ebb7d2629a522e02cee00e40680f02bac94e759d9f98fa8 |
|
MD5 | 47129110320f682b43989dba2e351c1e |
|
BLAKE2b-256 | bb8a9535537c515e0dfdd6b05f9392878e1ec8fb5906480e40e5c517171a6c6e |