End-to-end visual document retrieval with ColPali, featuring two-stage pooling for scalable search

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Ara_Yeroyan_2002

These details have not been verified by PyPI

Project description

Visual RAG Toolkit

End-to-end visual document retrieval toolkit featuring fast multi-stage retrieval (prefetch with pooled vectors + exact MaxSim reranking).

Try the Live Demo - Upload PDFs, index to Qdrant, and query with visual retrieval.

This repo contains:

a Python package (visual_rag)
a Streamlit demo app (demo/)
benchmark & evaluation scripts for ViDoRe v2 (benchmarks/)

🎯 Key Features

Modular: PDF → images, embedding, Qdrant indexing, retrieval can be used independently.
Multi-stage retrieval: two-stage and three-stage retrieval modes built for Qdrant named vectors.
Model-aware embedding: ColSmol + ColPali support behind a single VisualEmbedder interface.
Token hygiene: query special-token filtering by default for more stable MaxSim behavior.
Practical pipelines: robust indexing, retries, optional Cloudinary image URLs, evaluation reporting.

📦 Installation

# Core package (minimal dependencies)
pip install visual-rag-toolkit

# With specific features
pip install visual-rag-toolkit[ui]           # Streamlit demo dependencies
pip install visual-rag-toolkit[qdrant]       # Vector database
pip install visual-rag-toolkit[embedding]    # ColSmol/ColPali embedding support
pip install visual-rag-toolkit[cloudinary]   # Image CDN

# All dependencies
pip install visual-rag-toolkit[all]

System dependencies (PDF)

pdf2image requires Poppler.

macOS: brew install poppler
Ubuntu/Debian: sudo apt-get update && sudo apt-get install -y poppler-utils

🚀 Quick Start

Minimal: embed a query and run two-stage search (server-side)

from qdrant_client import QdrantClient
from visual_rag import VisualEmbedder, TwoStageRetriever

client = QdrantClient(url="https://YOUR_QDRANT", api_key="YOUR_KEY")
collection_name = "your_collection"

# Embed query tokens
embedder = VisualEmbedder(model_name="vidore/colpali-v1.3")
q = embedder.embed_query("What is the budget allocation?")

# Fast path: all stages computed in Qdrant (prefetch + exact rerank)
retriever = TwoStageRetriever(client, collection_name)
results = retriever.search_server_side(
    query_embedding=q,
    top_k=10,
    prefetch_k=256,
    stage1_mode="tokens_vs_experimental",  # or: tokens_vs_tiles / pooled_query_vs_tiles / pooled_query_vs_global
)

for r in results[:3]:
    print(r["id"], r["score_final"])

End-to-end: ingest PDFs (with cropping) → index in Qdrant

This is the "SDK-style" pipeline: PDF → images → optional crop → embed → store vectors + payload in Qdrant.

import os
from pathlib import Path

import numpy as np
import torch

from visual_rag import VisualEmbedder
from visual_rag.indexing import ProcessingPipeline, QdrantIndexer

QDRANT_URL = os.environ["QDRANT_URL"]
QDRANT_KEY = os.getenv("QDRANT_API_KEY", "")

collection = "my_visual_docs"

embedder = VisualEmbedder(
    model_name="vidore/colSmol-500M",
    torch_dtype=torch.float16,
    output_dtype=np.float16,
    batch_size=8,
)

indexer = QdrantIndexer(
    url=QDRANT_URL,
    api_key=QDRANT_KEY,
    collection_name=collection,
    prefer_grpc=True,
    vector_datatype="float16",
)

# Creates collection + required payload indexes (e.g., "filename" for skip_existing)
indexer.create_collection(force_recreate=False)

pipeline = ProcessingPipeline(
    embedder=embedder,
    indexer=indexer,
    embedding_strategy="all",  # store full tokens + pooled vectors in one pass
    crop_empty=True,
    crop_empty_percentage_to_remove=0.99,  # kept for traceability
    crop_empty_remove_page_number=True,
    crop_empty_preserve_border_px=1,
    crop_empty_uniform_rowcol_std_threshold=3.0,
)

pdfs = [Path("docs/a.pdf"), Path("docs/b.pdf")]
for pdf_path in pdfs:
    result = pipeline.process_pdf(
        pdf_path,
        skip_existing=True,  # Skip pages already in Qdrant (uses filename index)
        upload_to_cloudinary=False,
        upload_to_qdrant=True,
    )
    # Logs automatically shown:
    # [10:23:45] 📚 Processing PDF: a.pdf
    # [10:23:45] 🖼️ Converting PDF to images...
    # [10:23:46]    ✅ Converted 12 pages
    # [10:23:46] 📦 Processing pages 1-8/12
    # [10:23:46] 🤖 Generating embeddings for 8 pages...
    # [10:23:48] 📤 Uploading batch of 8 pages...
    # [10:23:48]    ✅ Uploaded 8 points to Qdrant
    # [10:23:48] 📦 Processing pages 9-12/12
    # [10:23:48] 🤖 Generating embeddings for 4 pages...
    # [10:23:50] 📤 Uploading batch of 4 pages...
    # [10:23:50]    ✅ Uploaded 4 points to Qdrant
    # [10:23:50] ✅ Completed a.pdf: 12 uploaded, 0 skipped, 0 failed

CLI equivalent:

export QDRANT_URL="https://YOUR_QDRANT"
export QDRANT_API_KEY="YOUR_KEY"

visual-rag process \
  --reports-dir ./docs \
  --collection my_visual_docs \
  --model vidore/colSmol-500M \
  --strategy all \
  --batch-size 8 \
  --qdrant-vector-dtype float16 \
  --prefer-grpc \
  --crop-empty \
  --crop-empty-remove-page-number

Process a PDF into images (no embedding, no vector DB)

from pathlib import Path
from visual_rag import PDFProcessor

processor = PDFProcessor(dpi=140)
images, texts = processor.process_pdf(Path("report.pdf"))
print(len(images), "pages")

🔬 Multi-stage Retrieval (Two-stage / Three-stage)

Traditional ColBERT-style MaxSim scoring compares all query tokens vs all document tokens, which becomes expensive at scale.

Our approach:

Stage 1: Fast prefetch with tile-level pooled vectors
         ├── Pool each tile (64 patches) → num_tiles vectors
         ├── Use HNSW index for O(log N) retrieval  
         └── Retrieve top-K candidates (e.g., 200)

Stage 2: Exact MaxSim reranking on candidates
         ├── Load full multi-vector embeddings
         ├── Compute exact ColBERT MaxSim scores
         └── Return top-k results (e.g., 10)

Three-stage extends this with an additional "cheap prefetch" stage before stage 2.

📁 Package Structure

visual-rag-toolkit/
├── visual_rag/              # Import as: from visual_rag import ...
│   ├── embedding/           # VisualEmbedder, pooling functions
│   ├── indexing/            # PDFProcessor, QdrantIndexer, CloudinaryUploader
│   ├── retrieval/           # TwoStageRetriever
│   ├── visualization/       # Saliency maps
│   ├── cli/                 # Command-line: visual-rag process/search
│   └── config.py            # load_config, get, get_section
│
├── benchmarks/              # ViDoRe evaluation scripts
└── examples/                # Usage examples

⚙️ Configuration

Configure via environment variables or YAML:

# Qdrant credentials (preferred names used by the demo + scripts)
export QDRANT_URL="https://your-cluster.qdrant.io"
export QDRANT_API_KEY="your-api-key"

# Special token handling (default: filter them out)
export VISUALRAG_INCLUDE_SPECIAL_TOKENS=true  # Include special tokens

Or use a config file (visual_rag.yaml):

model:
  name: "vidore/colSmol-500M"
  batch_size: 4
  
qdrant:
  url: "https://your-cluster.qdrant.io"
  collection: "my_documents"
  
search:
  strategy: "two_stage"  # or "multi_vector", "pooled"
  prefetch_k: 200
  top_k: 10

🖥️ Demo (Streamlit)

pip install "visual-rag-toolkit[ui,qdrant,embedding,pdf]"

# Option A: from Python
python -c "import visual_rag; visual_rag.demo()"

# Option B: CLI launcher
visual-rag-demo

📊 Benchmark Evaluation

Run ViDoRe benchmark evaluation:

# Example: evaluate a collection against ViDoRe BEIR datasets in Qdrant
python -m benchmarks.vidore_beir_qdrant.run_qdrant_beir \
  --datasets vidore/esg_reports_v2 vidore/biomedical_lectures_v2 \
  --collection YOUR_COLLECTION \
  --mode two_stage \
  --stage1-mode tokens_vs_experimental \
  --prefetch-k 256 \
  --top-k 100 \
  --evaluation-scope union

More commands (including multi-stage variants and cropping configs) live in:

examples/COMMANDS.md

🔧 Development

git clone https://github.com/Ara-Yeroyan/visual-rag-toolkit
cd visual-rag-toolkit
pip install -e ".[dev]"
pytest tests/ -v

📄 Citation

If you use this toolkit in your research, please cite:

@software{visual_rag_toolkit,
  title = {Visual RAG Toolkit: Scalable Visual Document Retrieval with 1D Convolutional Pooling},
  author = {Ara Yeroyan},
  year = {2026},
  url = {https://github.com/Ara-Yeroyan/visual-rag-toolkit}
}

📝 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Qdrant - Vector database with multi-vector support
ColPali - Visual document retrieval models
ViDoRe - Benchmark dataset

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Ara_Yeroyan_2002

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Feb 12, 2026

0.1.6

Feb 5, 2026

This version

0.1.5

Feb 5, 2026

0.1.4

Feb 5, 2026

0.1.3

Feb 5, 2026

0.1.2

Feb 5, 2026

0.1.1

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

visual_rag_toolkit-0.1.5.tar.gz (122.3 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

visual_rag_toolkit-0.1.5-py3-none-any.whl (142.8 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file visual_rag_toolkit-0.1.5.tar.gz.

File metadata

Download URL: visual_rag_toolkit-0.1.5.tar.gz
Upload date: Feb 5, 2026
Size: 122.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for visual_rag_toolkit-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`0fa8e3d88dd79a52fa7dd5524b7dfe8677d0f1d3eec76f8710379911e24bd5bd`
MD5	`67e37f3c5fee5a07877fcf87cc5e683f`
BLAKE2b-256	`f57bf3f0961e3420d8a39acfafde08d997e8f6a8b44b94e6fa9a685409e68340`

See more details on using hashes here.

Provenance

The following attestation bundles were made for visual_rag_toolkit-0.1.5.tar.gz:

Publisher: publish_pypi.yaml on Ara-Yeroyan/visual-rag-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: visual_rag_toolkit-0.1.5.tar.gz
- Subject digest: 0fa8e3d88dd79a52fa7dd5524b7dfe8677d0f1d3eec76f8710379911e24bd5bd
- Sigstore transparency entry: 919578455
- Sigstore integration time: Feb 5, 2026
Source repository:
- Permalink: Ara-Yeroyan/visual-rag-toolkit@b277688cc918ddd2feea2d484985958760059c3e
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/Ara-Yeroyan
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_pypi.yaml@b277688cc918ddd2feea2d484985958760059c3e
- Trigger Event: push

File details

Details for the file visual_rag_toolkit-0.1.5-py3-none-any.whl.

File metadata

Download URL: visual_rag_toolkit-0.1.5-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 142.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for visual_rag_toolkit-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a2e268030fdc8ff4790fbafc9209fa5f1a73ff03dcadfade80539db1b841139`
MD5	`0832e171ddcf6d23023659d29358b462`
BLAKE2b-256	`cc2067ce211e294d53a9576862b86158adf502c45b3dad32aefc49aa42e6ed8e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for visual_rag_toolkit-0.1.5-py3-none-any.whl:

Publisher: publish_pypi.yaml on Ara-Yeroyan/visual-rag-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: visual_rag_toolkit-0.1.5-py3-none-any.whl
- Subject digest: 2a2e268030fdc8ff4790fbafc9209fa5f1a73ff03dcadfade80539db1b841139
- Sigstore transparency entry: 919578456
- Sigstore integration time: Feb 5, 2026
Source repository:
- Permalink: Ara-Yeroyan/visual-rag-toolkit@b277688cc918ddd2feea2d484985958760059c3e
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/Ara-Yeroyan
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_pypi.yaml@b277688cc918ddd2feea2d484985958760059c3e
- Trigger Event: push

visual-rag-toolkit 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Visual RAG Toolkit

🎯 Key Features

📦 Installation

System dependencies (PDF)

🚀 Quick Start

Minimal: embed a query and run two-stage search (server-side)

End-to-end: ingest PDFs (with cropping) → index in Qdrant

Process a PDF into images (no embedding, no vector DB)

🔬 Multi-stage Retrieval (Two-stage / Three-stage)

📁 Package Structure

⚙️ Configuration

🖥️ Demo (Streamlit)

📊 Benchmark Evaluation

🔧 Development

📄 Citation

📝 License

🙏 Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance