Skip to main content

SauerkrautLM-ColPali: Multi-vector vision retrieval models based on the ColPali architecture. Fork of colpali-engine with additional model architectures.

Project description

SauerkrautLM-ColPali: Multi-Vector Vision Retrieval Models

GitHub Hugging Face License


Fork Notice: This repository is a fork of colpali-engine by Illuin Technology. We extend the original codebase with additional model architectures for document retrieval using vision language models.

Overview

SauerkrautLM-ColPali provides model implementations and processors for multi-vector vision retrieval based on the ColPali architecture. This package includes support for several VLM backbones:

  • ColQwen3 - Based on Qwen3-VL (2B, 4B, 8B)
  • ColLFM2 - Based on LargeFlamingoModel 2 (~450M parameters)
  • ColMinistral3 - Based on Ministral-3B-Instruct with Pixtral vision encoder

Models

Model Parameters VRAM (bf16) Max Tokens Base Model License
SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 1.7B ~3.4 GB 262K Qwen3-VL-1.7B Apache 2.0
SauerkrautLM-ColQwen3-2b-v0.1 2.2B ~4.4 GB 262K Qwen3-VL-2B Apache 2.0
SauerkrautLM-ColQwen3-4b-v0.1 4B ~8 GB 262K Qwen3-VL-4B Apache 2.0
SauerkrautLM-ColQwen3-8b-v0.1 8B ~16 GB 262K Qwen3-VL-8B Apache 2.0
SauerkrautLM-ColLFM2-450M-v0.1 450M ~0.9 GB 32K LFM2 Apache 2.0
SauerkrautLM-ColMinistral3-3b-v0.1 3B ~6 GB 262K Ministral-3B Apache 2.0

Supported Languages: English, German, French, Spanish, Italian, Portuguese

🎯 Why Visual Document Retrieval?

Traditional document retrieval relies on OCR + Text Search, which has significant limitations:

Approach Limitations
OCR-based ❌ Loses layout information, tables, charts, images
OCR-based ❌ OCR errors compound in downstream tasks
OCR-based ❌ Struggles with handwriting, low-quality scans
OCR-based ❌ Cannot understand visual elements (logos, diagrams)

Visual Document Retrieval solves these problems by:

  • Direct visual understanding - No OCR errors, preserves full document context
  • Layout-aware - Understands tables, forms, multi-column layouts
  • Multimodal - Combines text and visual elements naturally
  • End-to-end - Single model for retrieval, no pipeline complexity

ViDoRe v1 Benchmark - 128-dim Models

🏆 Benchmark Results

Our models achieve state-of-the-art performance on the ViDoRe (Visual Document Retrieval) benchmarks while maintaining a compact 128-dimensional embedding space for efficient retrieval.

Key Highlights

Achievement Model Score Comparison
🥇 #1 ViDoRe v1 (128-dim) ColQwen3-8b 91.08 Beats all 128-dim models
🥇 #1 ViDoRe v3 (128-dim) ColQwen3-8b 58.55 Best 128-dim model
🥇 #1 Small Model (<1B) ColLFM2-450M 83.56 Beats colSmol-500M with fewer params
🥇 #1 Medium (1-3B, 128-dim) ColQwen3-2b 90.24 Best 128-dim in 1-3B class
Most Efficient All models 128 dim Same dim as ColPali, 2.5-24x smaller than high-dim competitors

128-dim Models Comparison (Fair Comparison)

When comparing only models with the same 128-dimensional embedding space:

Model Params ViDoRe v1 MTEB v1+v2 ViDoRe v3
SauerkrautLM-ColQwen3-8b-v0.1 8.0B 91.08 (#1) 82.91 (#2) 58.55 (#1)
EvoQwen2.5-VL-Retriever-7B-v1 7.0B 90.68 (#3) 83.41 (#1) -
SauerkrautLM-ColQwen3-4b-v0.1 4.0B 90.80 (#2) 81.97 (#4) 56.03 (#4)
EvoQwen2.5-VL-Retriever-3B-v1 3.0B 90.67 (#4) 82.76 (#3) -
SauerkrautLM-ColQwen3-2b-v0.1 2.2B 90.24 (#5) 81.02 (#6) 54.32 (#5)
colnomic-embed-multimodal-7b 7.0B 89.72 (#7) 81.30 (#5) 57.64 (#2)
colnomic-embed-multimodal-3b 3.0B 89.86 (#6) 80.09 (#7) 56.40 (#3)
colqwen2-v1.0 2.2B 89.23 (#8) 79.74 (#8) 44.18 (#7)
SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 1.7B 88.89 (#9) 77.94 (#9) 48.76 (#6)
colpali-v1.3 2.9B 84.75 (#10) 76.17 (#10) 42.95 (#9)
SauerkrautLM-ColLFM2-450M-v0.1 450M 83.56 74.33 43.32 (#8)

Rankings among 128-dim models only. ⭐ = Best in category. Bold = our models.

Size Category Comparison (128-dim Models)

Small Models (<1B):

Model Params ViDoRe v1 MTEB v1+v2 ViDoRe v3
SauerkrautLM-ColLFM2-450M-v0.1 450M 83.56 74.33 43.32
colSmol-500M 500M 82.49 71.17 -
colSmol-256M 256M 79.74 66.90 20.73

Medium Models (1-3B):

Model Params ViDoRe v1 MTEB v1+v2 ViDoRe v3
SauerkrautLM-ColQwen3-2b-v0.1 2.2B 90.24 81.02 54.32
colqwen2-v1.0 2.2B 89.23 79.74 44.18
SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 1.7B 88.89 77.94 48.76

Large Models (3-5B):

Model Params ViDoRe v1 MTEB v1+v2 ViDoRe v3
SauerkrautLM-ColQwen3-4b-v0.1 4.0B 90.80 81.97 56.03
EvoQwen2.5-VL-Retriever-3B-v1 3.0B 90.67 82.76 -
colnomic-embed-multimodal-3b 3.0B 89.86 80.09 56.40

XLarge Models (5-10B):

Model Params ViDoRe v1 MTEB v1+v2 ViDoRe v3
SauerkrautLM-ColQwen3-8b-v0.1 8.0B 91.08 82.91 58.55
EvoQwen2.5-VL-Retriever-7B-v1 7.0B 90.68 83.41 -
colnomic-embed-multimodal-7b 7.0B 89.72 81.30 57.64

⭐ = Best 128-dim model in category

Why Choose Our Models?

  1. 🏆 #1 in 128-dim Class: Our ColQwen3-8b beats ALL other 128-dim models on ViDoRe v1 and v3

  2. ⚡ Compact Embeddings: All our models use 128 dimensions - same as ColPali/ColQwen2/colSmol

    • No storage overhead compared to standard ColPali models
    • 2.5x smaller than tomoro (320 dim)
    • 16-24x smaller than llama-nemoretriever (2048-3072 dim)
  3. 💰 Best-in-Class for Every Size:

    • Small (<1B): ColLFM2-450M beats colSmol-500M with 10% fewer parameters
    • Medium (1-3B): ColQwen3-2b beats colqwen2-v1.0 by +1.01 points
    • Large (3-5B): ColQwen3-4b achieves 90.80, only -0.20 behind much larger llama-nemo
    • XLarge (5-10B): ColQwen3-8b achieves the highest 128-dim score ever
  4. 🌍 Multilingual: Trained on 6 languages (EN, DE, FR, ES, IT, PT)

  5. 🔧 Easy Integration: MTEB-compatible for standardized evaluation

Training

Hardware

Model Size GPUs Effective Batch Size
450M - 4B 4x NVIDIA RTX 6000 Ada (48GB) 256
8B 4x NVIDIA A100 SXM (80GB) 256

Training Datasets

Our models were trained on a diverse mix of public and proprietary datasets:

Dataset Type Description
vidore/colpali_train_set Public Original ColPali training data with document-query pairs
openbmb/VisRAG-Ret-Train-In-domain-data Public Visual RAG retrieval training data
llamaindex/vdr-multilingual-train Public Multilingual document retrieval data
unicamp-dl/mmarco Public Multilingual MS MARCO (used for recovery training)
VAGO Multilingual Dataset 1 In-house Proprietary multilingual document-query pairs
VAGO Multilingual Dataset 2 In-house Proprietary multilingual document-query pairs

Special Training Techniques

Model Technique Description
ColLFM2-450M Curriculum Learning Progressive difficulty training across 4 stages
ColLFM2-450M Hierarchical Merge Combined mMARCO specialist with retrieval model
ColQwen3-1.7b-Turbo Structured Pruning Layer + intermediate size pruning (-23% params)
ColQwen3-1.7b-Turbo mMARCO Recovery Pre-training to heal pruned model

Installation

# From source (recommended)
pip install git+https://github.com/VAGOsolutions/sauerkrautlm-colpali

# For ColMinistral3 models (requires transformers 5.0.0rc0)
pip install "sauerkrautlm-colpali[ministral]"

Note: ColMinistral3 requires transformers>=5.0.0rc0. Install with pip install "sauerkrautlm-colpali[ministral]" or manually install the RC version.

Quick Start

ColQwen3 Example

import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColQwen3, ColQwen3Processor

model_name = "VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1"

# Load model and processor
model = ColQwen3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
).eval()

processor = ColQwen3Processor.from_pretrained(model_name)

# Prepare inputs
images = [Image.open("document.png")]
queries = ["What is the main topic of this document?"]

# Process inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Get embeddings
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

# Calculate similarity scores
scores = processor.score(query_embeddings, image_embeddings)
print(f"Similarity scores: {scores}")

ColMinistral3 Example

import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColMinistral3, ColMinistral3Processor

model_name = "VAGOsolutions/SauerkrautLM-ColMinistral3-3b-v0.1"

model = ColMinistral3.from_pretrained(model_name)
model = model.to(dtype=torch.bfloat16, device="cuda:0").eval()

processor = ColMinistral3Processor.from_pretrained(model_name)

# Same usage pattern as ColQwen3...

ColLFM2 Example

import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColLFM2, ColLFM2Processor

model_name = "VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1"

model = ColLFM2.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
).eval()

processor = ColLFM2Processor.from_pretrained(model_name)

# Same usage pattern as ColQwen3...

MTEB Integration

This package includes MTEB (Massive Text Embedding Benchmark) integration for standardized evaluation on the ViDoRe benchmark. See the mteb_integration/ folder for details.

# Example: Run MTEB evaluation
import mteb

model = mteb.get_model("VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1")
tasks = mteb.get_tasks(tasks=["VidoreArxivQARetrieval"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")

Architecture

All models in this package follow the ColPali architecture:

  1. Vision Encoder: Extracts patch embeddings from document images
  2. Language Model: Processes visual tokens alongside text tokens
  3. Projection Layer: Maps hidden states to 128-dimensional embedding space
  4. Late Interaction: MaxSim scoring between query and document embeddings
Document Image → Vision Encoder → Visual Tokens → LLM → Projection → Multi-Vector Embeddings
                                                                              ↓
Query Text → Tokenizer → LLM → Projection → Multi-Vector Embeddings → MaxSim Score

Original ColPali

This package is based on the excellent work by Illuin Technology. For the original ColPali models (ColPali, ColQwen2, ColQwen2.5, ColSmol), please use the original colpali-engine package.

Citation

If you use this package, please cite both the original ColPali paper and our work:

@misc{sauerkrautlm-colpali-2025,
  title={SauerkrautLM-ColPali: Multi-Vector Vision Retrieval Models},
  author={David Golchinfar},
  organization={VAGO Solutions},
  year={2025},
  url={https://github.com/VAGOsolutions/sauerkrautlm-colpali}
}

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

📊 Benchmark Visualizations

ViDoRe v1 Benchmark (128-dim Models)

ViDoRe v1 Benchmark - 128-dim Models

MTEB v1+v2 Benchmark (128-dim Models)

MTEB v1+v2 Benchmark - 128-dim Models

ViDoRe v3 Benchmark (128-dim Models)

ViDoRe v3 Benchmark - 128-dim Models

Our Models vs High-dim Models

ViDoRe v1 - Our Models vs High-dim

📋 Summary Tables

128-dim Models Comparison

128-dim Models Summary

Comparison vs High-dim Models

High-dim Comparison

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sauerkrautlm_colpali-0.1.0.tar.gz (16.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sauerkrautlm_colpali-0.1.0-py3-none-any.whl (72.6 kB view details)

Uploaded Python 3

File details

Details for the file sauerkrautlm_colpali-0.1.0.tar.gz.

File metadata

  • Download URL: sauerkrautlm_colpali-0.1.0.tar.gz
  • Upload date:
  • Size: 16.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.9

File hashes

Hashes for sauerkrautlm_colpali-0.1.0.tar.gz
Algorithm Hash digest
SHA256 51d0a0b0d432d92a51f10b78eb628f0a1736a9e5a2c43e5c37c608fdf1f2075c
MD5 269442b03585abb1a83f84899aabf0fd
BLAKE2b-256 ba4a7968f6575d98801d5861936ed2d4f639776626587c5931bd39d97b9fd05d

See more details on using hashes here.

File details

Details for the file sauerkrautlm_colpali-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sauerkrautlm_colpali-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 680b091cca6b496570db690dbabbd31f90c11c37f8d6b3834dd9fa91ae237de8
MD5 326524c19a83a5a46b26437d1e493158
BLAKE2b-256 9202636f6039041cb1e8c9000bc1aa845865b20df9d869b01a2d3413a5f31fda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page