End-to-end visual document retrieval with ColPali, featuring two-stage pooling for scalable search
Project description
Visual RAG Toolkit
End-to-end visual document retrieval toolkit featuring fast multi-stage retrieval (prefetch with pooled vectors + exact MaxSim reranking).
Try the Live Demo - Upload PDFs, index to Qdrant, and query with visual retrieval.
Watch the Tutorial - Video walkthrough of the toolkit in action.
This repo contains:
- a Python package (
visual_rag) - a Streamlit demo app (
demo/) - benchmark & evaluation scripts for ViDoRe v2 (
benchmarks/)
๐ฏ Key Features
- Modular: PDF โ images, embedding, Qdrant indexing, retrieval can be used independently.
- Multi-stage retrieval: two-stage and three-stage retrieval modes built for Qdrant named vectors.
- Model-aware embedding: ColSmol, ColPali, and ColQwen2/2.5 support behind a single
VisualEmbedderinterface. - Configurable pooling: adaptive mean-pooling cap for ColQwen2.5 (
--max-mean-pool-vectors), and experimental pooling stored as Qdrant named vectors (experimental_pooling(ColQwen Gaussian alias),experimental_pooling_gaussian,experimental_pooling_triangular,experimental_pooling_{k}(ColPali),experimental_pooling_2d(ColSmol)). - Single-stage ablations: direct search modes over experimental pooled vectors (tokens-vs-doc and pooled-query-vs-doc) for fast storage-reduction experiments.
- Token hygiene: query special-token filtering by default for more stable MaxSim behavior.
- Practical pipelines: robust indexing, retries, optional Cloudinary image URLs, evaluation reporting.
๐ฆ Installation
# Core package (minimal dependencies)
pip install visual-rag-toolkit
# With specific features
pip install visual-rag-toolkit[ui] # Streamlit demo dependencies
pip install visual-rag-toolkit[qdrant] # Vector database
pip install visual-rag-toolkit[embedding] # ColSmol/ColPali/ColQwen2(.5) embedding support
pip install visual-rag-toolkit[cloudinary] # Image CDN
# All dependencies
pip install visual-rag-toolkit[all]
System dependencies (PDF)
pdf2image requires Poppler.
- macOS:
brew install poppler - Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y poppler-utils
ColQwen2.5 note: vidore/colqwen2.5-v0.2 requires transformers>=4.45.0 and colpali-engine>=0.3.7 (installing ColPali/ColQwen from source may be required for the latest processors).
๐ Quick Start
Minimal: embed a query and run two-stage search (server-side)
from qdrant_client import QdrantClient
from visual_rag import VisualEmbedder, TwoStageRetriever
client = QdrantClient(url="https://YOUR_QDRANT", api_key="YOUR_KEY")
collection_name = "your_collection"
# Embed query tokens
embedder = VisualEmbedder(model_name="vidore/colpali-v1.3")
q = embedder.embed_query("What is the budget allocation?")
# Fast path: all stages computed in Qdrant (prefetch + exact rerank)
retriever = TwoStageRetriever(client, collection_name)
results = retriever.search_server_side(
query_embedding=q,
top_k=10,
prefetch_k=256,
stage1_mode="tokens_vs_experimental_pooling", # or: tokens_vs_standard_pooling / pooled_query_vs_standard_pooling / pooled_query_vs_global
)
for r in results[:3]:
print(r["id"], r["score_final"])
End-to-end: ingest PDFs (with cropping) โ index in Qdrant
This is the "SDK-style" pipeline: PDF โ images โ optional crop โ embed โ store vectors + payload in Qdrant.
import os
from pathlib import Path
import numpy as np
import torch
from visual_rag import VisualEmbedder
from visual_rag.indexing import ProcessingPipeline, QdrantIndexer
QDRANT_URL = os.environ["QDRANT_URL"]
QDRANT_KEY = os.getenv("QDRANT_API_KEY", "")
collection = "my_visual_docs"
embedder = VisualEmbedder(
model_name="vidore/colSmol-500M",
torch_dtype=torch.float16,
output_dtype=np.float16,
batch_size=8,
)
indexer = QdrantIndexer(
url=QDRANT_URL,
api_key=QDRANT_KEY,
collection_name=collection,
prefer_grpc=True,
vector_datatype="float16",
)
# Creates collection + required payload indexes (e.g., "filename" for skip_existing)
indexer.create_collection(force_recreate=False)
pipeline = ProcessingPipeline(
embedder=embedder,
indexer=indexer,
embedding_strategy="all", # store full tokens + pooled vectors in one pass
crop_empty=True,
crop_empty_percentage_to_remove=0.99, # kept for traceability
crop_empty_remove_page_number=True,
crop_empty_preserve_border_px=1,
crop_empty_uniform_rowcol_std_threshold=3.0,
)
pdfs = [Path("docs/a.pdf"), Path("docs/b.pdf")]
for pdf_path in pdfs:
result = pipeline.process_pdf(
pdf_path,
skip_existing=True, # Skip pages already in Qdrant (uses filename index)
upload_to_cloudinary=False,
upload_to_qdrant=True,
)
# Logs automatically shown:
# [10:23:45] ๐ Processing PDF: a.pdf
# [10:23:45] ๐ผ๏ธ Converting PDF to images...
# [10:23:46] โ
Converted 12 pages
# [10:23:46] ๐ฆ Processing pages 1-8/12
# [10:23:46] ๐ค Generating embeddings for 8 pages...
# [10:23:48] ๐ค Uploading batch of 8 pages...
# [10:23:48] โ
Uploaded 8 points to Qdrant
# [10:23:48] ๐ฆ Processing pages 9-12/12
# [10:23:48] ๐ค Generating embeddings for 4 pages...
# [10:23:50] ๐ค Uploading batch of 4 pages...
# [10:23:50] โ
Uploaded 4 points to Qdrant
# [10:23:50] โ
Completed a.pdf: 12 uploaded, 0 skipped, 0 failed
CLI equivalent:
export QDRANT_URL="https://YOUR_QDRANT"
export QDRANT_API_KEY="YOUR_KEY"
visual-rag process \
--reports-dir ./docs \
--collection my_visual_docs \
--model vidore/colSmol-500M \
--strategy all \
--batch-size 8 \
--qdrant-vector-dtype float16 \
--prefer-grpc \
--crop-empty \
--crop-empty-remove-page-number
Process a PDF into images (no embedding, no vector DB)
from pathlib import Path
from visual_rag import PDFProcessor
processor = PDFProcessor(dpi=140)
images, texts = processor.process_pdf(Path("report.pdf"))
print(len(images), "pages")
๐ฌ Multi-stage Retrieval (Two-stage / Three-stage)
Traditional ColBERT-style MaxSim scoring compares all query tokens vs all document tokens, which becomes expensive at scale.
Our approach:
Stage 1: Fast prefetch with tile-level pooled vectors
โโโ Pool each tile (64 patches) โ num_tiles vectors
โโโ Use HNSW index for O(log N) retrieval
โโโ Retrieve top-K candidates (e.g., 200)
Stage 2: Exact MaxSim reranking on candidates
โโโ Load full multi-vector embeddings
โโโ Compute exact ColBERT MaxSim scores
โโโ Return top-k results (e.g., 10)
Three-stage extends this with an additional "cheap prefetch" stage before stage 2.
๐ Package Structure
visual-rag-toolkit/
โโโ visual_rag/ # Import as: from visual_rag import ...
โ โโโ embedding/ # VisualEmbedder, pooling functions
โ โโโ indexing/ # PDFProcessor, QdrantIndexer, CloudinaryUploader
โ โโโ retrieval/ # TwoStageRetriever
โ โโโ visualization/ # Saliency maps
โ โโโ cli/ # Command-line: visual-rag process/search
โ โโโ config.py # load_config, get, get_section
โ
โโโ benchmarks/ # ViDoRe evaluation scripts
โโโ examples/ # Usage examples
โ๏ธ Configuration
Configure via environment variables or YAML:
# Qdrant credentials (preferred names used by the demo + scripts)
export QDRANT_URL="https://your-cluster.qdrant.io"
export QDRANT_API_KEY="your-api-key"
# Special token handling (default: filter them out)
export VISUALRAG_INCLUDE_SPECIAL_TOKENS=true # Include special tokens
Or use a config file (visual_rag.yaml):
model:
name: "vidore/colSmol-500M"
batch_size: 4
qdrant:
url: "https://your-cluster.qdrant.io"
collection: "my_documents"
search:
strategy: "two_stage" # or "multi_vector", "pooled"
prefetch_k: 200
top_k: 10
๐ฅ๏ธ Demo (Streamlit)
pip install "visual-rag-toolkit[ui,qdrant,embedding,pdf]"
# Option A: from Python
python -c "import visual_rag; visual_rag.demo()"
# Option B: CLI launcher
visual-rag-demo
๐ Benchmark Evaluation
Run ViDoRe benchmark evaluation:
# Example: evaluate a collection against ViDoRe BEIR datasets in Qdrant
python -m benchmarks.vidore_beir_qdrant.run_qdrant_beir \
--datasets vidore/esg_reports_v2 vidore/biomedical_lectures_v2 \
--collection YOUR_COLLECTION \
--mode two_stage \
--stage1-mode tokens_vs_experimental_pooling \
--prefetch-k 256 \
--top-k 100 \
--evaluation-scope union
More commands (including multi-stage variants and cropping configs) live in:
examples/COMMANDS.md
๐ง Development
git clone https://github.com/Ara-Yeroyan/visual-rag-toolkit
cd visual-rag-toolkit
pip install -e ".[dev]"
pytest tests/ -v
๐ Citation
If you use this toolkit in your research, please cite:
@software{visual_rag_toolkit,
title = {Visual RAG Toolkit: Scalable Visual Document Retrieval with 1D Convolutional Pooling},
author = {Ara Yeroyan},
year = {2026},
url = {https://github.com/Ara-Yeroyan/visual-rag-toolkit}
}
๐ License
MIT License - see LICENSE for details.
๐ Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file visual_rag_toolkit-0.2.0.tar.gz.
File metadata
- Download URL: visual_rag_toolkit-0.2.0.tar.gz
- Upload date:
- Size: 179.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24df3826a70d18702f7366a9836bb3a4a5ba123b17fa667ed88ebb1ac1badb57
|
|
| MD5 |
30180aa2d45502ab069cbb305760977f
|
|
| BLAKE2b-256 |
b8645765162472d1b1e4208d14001ac2936840e68c37a41c309e27ec00bb07ba
|
Provenance
The following attestation bundles were made for visual_rag_toolkit-0.2.0.tar.gz:
Publisher:
publish_pypi.yaml on Ara-Yeroyan/visual-rag-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
visual_rag_toolkit-0.2.0.tar.gz -
Subject digest:
24df3826a70d18702f7366a9836bb3a4a5ba123b17fa667ed88ebb1ac1badb57 - Sigstore transparency entry: 945567442
- Sigstore integration time:
-
Permalink:
Ara-Yeroyan/visual-rag-toolkit@37bf8c8514f60e3cbb642352e17a7d86b509425b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Ara-Yeroyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yaml@37bf8c8514f60e3cbb642352e17a7d86b509425b -
Trigger Event:
push
-
Statement type:
File details
Details for the file visual_rag_toolkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: visual_rag_toolkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 160.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f6a92158bf002f739776f0b8fc703698e32dfcb16a2ba8961a27bbc6008d20a
|
|
| MD5 |
86328c8f2c0c1374c4d4e8ef84f7966e
|
|
| BLAKE2b-256 |
39eb6aa36d72e34504376a677b1a07135e6a5eb03b19ccb3da05251aa217893e
|
Provenance
The following attestation bundles were made for visual_rag_toolkit-0.2.0-py3-none-any.whl:
Publisher:
publish_pypi.yaml on Ara-Yeroyan/visual-rag-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
visual_rag_toolkit-0.2.0-py3-none-any.whl -
Subject digest:
3f6a92158bf002f739776f0b8fc703698e32dfcb16a2ba8961a27bbc6008d20a - Sigstore transparency entry: 945567477
- Sigstore integration time:
-
Permalink:
Ara-Yeroyan/visual-rag-toolkit@37bf8c8514f60e3cbb642352e17a7d86b509425b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Ara-Yeroyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yaml@37bf8c8514f60e3cbb642352e17a7d86b509425b -
Trigger Event:
push
-
Statement type: