SauerkrautLM-ColPali: Multi-vector vision retrieval models based on the ColPali architecture. Fork of colpali-engine with additional model architectures.
Project description
SauerkrautLM-ColPali: Multi-Vector Vision Retrieval Models
Fork Notice: This repository is a fork of colpali-engine by Illuin Technology. We extend the original codebase with additional model architectures for document retrieval using vision language models.
Overview
SauerkrautLM-ColPali provides model implementations and processors for multi-vector vision retrieval based on the ColPali architecture. This package includes support for several VLM backbones:
- ColQwen3 - Based on Qwen3-VL (2B, 4B, 8B)
- ColLFM2 - Based on LargeFlamingoModel 2 (~450M parameters)
- ColMinistral3 - Based on Ministral-3B-Instruct with Pixtral vision encoder
Models
| Model | Parameters | VRAM (bf16) | Max Tokens | Base Model | License |
|---|---|---|---|---|---|
| SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 | 1.7B | ~3.4 GB | 262K | Qwen3-VL-1.7B | Apache 2.0 |
| SauerkrautLM-ColQwen3-2b-v0.1 | 2.2B | ~4.4 GB | 262K | Qwen3-VL-2B | Apache 2.0 |
| SauerkrautLM-ColQwen3-4b-v0.1 | 4B | ~8 GB | 262K | Qwen3-VL-4B | Apache 2.0 |
| SauerkrautLM-ColQwen3-8b-v0.1 | 8B | ~16 GB | 262K | Qwen3-VL-8B | Apache 2.0 |
| SauerkrautLM-ColLFM2-450M-v0.1 | 450M | ~0.9 GB | 32K | LFM2 | Apache 2.0 |
| SauerkrautLM-ColMinistral3-3b-v0.1 | 3B | ~6 GB | 262K | Ministral-3B | Apache 2.0 |
Supported Languages: English, German, French, Spanish, Italian, Portuguese
🎯 Why Visual Document Retrieval?
Traditional document retrieval relies on OCR + Text Search, which has significant limitations:
| Approach | Limitations |
|---|---|
| OCR-based | ❌ Loses layout information, tables, charts, images |
| OCR-based | ❌ OCR errors compound in downstream tasks |
| OCR-based | ❌ Struggles with handwriting, low-quality scans |
| OCR-based | ❌ Cannot understand visual elements (logos, diagrams) |
Visual Document Retrieval solves these problems by:
- ✅ Direct visual understanding - No OCR errors, preserves full document context
- ✅ Layout-aware - Understands tables, forms, multi-column layouts
- ✅ Multimodal - Combines text and visual elements naturally
- ✅ End-to-end - Single model for retrieval, no pipeline complexity
🏆 Benchmark Results
Our models achieve state-of-the-art performance on the ViDoRe (Visual Document Retrieval) benchmarks while maintaining a compact 128-dimensional embedding space for efficient retrieval.
Key Highlights
| Achievement | Model | Score | Comparison |
|---|---|---|---|
| 🥇 #1 ViDoRe v1 (128-dim) | ColQwen3-8b | 91.08 | Beats all 128-dim models |
| 🥇 #1 ViDoRe v3 (128-dim) | ColQwen3-8b | 58.55 | Best 128-dim model |
| 🥇 #1 Small Model (<1B) | ColLFM2-450M | 83.56 | Beats colSmol-500M with fewer params |
| 🥇 #1 Medium (1-3B, 128-dim) | ColQwen3-2b | 90.24 | Best 128-dim in 1-3B class |
| ⚡ Most Efficient | All models | 128 dim | Same dim as ColPali, 2.5-24x smaller than high-dim competitors |
128-dim Models Comparison (Fair Comparison)
When comparing only models with the same 128-dimensional embedding space:
| Model | Params | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|
| SauerkrautLM-ColQwen3-8b-v0.1 ⭐ | 8.0B | 91.08 (#1) | 82.91 (#2) | 58.55 (#1) |
| EvoQwen2.5-VL-Retriever-7B-v1 | 7.0B | 90.68 (#3) | 83.41 (#1) | - |
| SauerkrautLM-ColQwen3-4b-v0.1 | 4.0B | 90.80 (#2) | 81.97 (#4) | 56.03 (#4) |
| EvoQwen2.5-VL-Retriever-3B-v1 | 3.0B | 90.67 (#4) | 82.76 (#3) | - |
| SauerkrautLM-ColQwen3-2b-v0.1 | 2.2B | 90.24 (#5) | 81.02 (#6) | 54.32 (#5) |
| colnomic-embed-multimodal-7b | 7.0B | 89.72 (#7) | 81.30 (#5) | 57.64 (#2) |
| colnomic-embed-multimodal-3b | 3.0B | 89.86 (#6) | 80.09 (#7) | 56.40 (#3) |
| colqwen2-v1.0 | 2.2B | 89.23 (#8) | 79.74 (#8) | 44.18 (#7) |
| SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 | 1.7B | 88.89 (#9) | 77.94 (#9) | 48.76 (#6) |
| colpali-v1.3 | 2.9B | 84.75 (#10) | 76.17 (#10) | 42.95 (#9) |
| SauerkrautLM-ColLFM2-450M-v0.1 | 450M | 83.56 | 74.33 | 43.32 (#8) |
Rankings among 128-dim models only. ⭐ = Best in category. Bold = our models.
Size Category Comparison (128-dim Models)
Small Models (<1B):
| Model | Params | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|
| SauerkrautLM-ColLFM2-450M-v0.1 ⭐ | 450M | 83.56 | 74.33 | 43.32 |
| colSmol-500M | 500M | 82.49 | 71.17 | - |
| colSmol-256M | 256M | 79.74 | 66.90 | 20.73 |
Medium Models (1-3B):
| Model | Params | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|
| SauerkrautLM-ColQwen3-2b-v0.1 ⭐ | 2.2B | 90.24 | 81.02 | 54.32 |
| colqwen2-v1.0 | 2.2B | 89.23 | 79.74 | 44.18 |
| SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 | 1.7B | 88.89 | 77.94 | 48.76 |
Large Models (3-5B):
| Model | Params | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|
| SauerkrautLM-ColQwen3-4b-v0.1 ⭐ | 4.0B | 90.80 | 81.97 | 56.03 |
| EvoQwen2.5-VL-Retriever-3B-v1 | 3.0B | 90.67 | 82.76 | - |
| colnomic-embed-multimodal-3b | 3.0B | 89.86 | 80.09 | 56.40 |
XLarge Models (5-10B):
| Model | Params | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|
| SauerkrautLM-ColQwen3-8b-v0.1 ⭐ | 8.0B | 91.08 | 82.91 | 58.55 |
| EvoQwen2.5-VL-Retriever-7B-v1 | 7.0B | 90.68 | 83.41 | - |
| colnomic-embed-multimodal-7b | 7.0B | 89.72 | 81.30 | 57.64 |
⭐ = Best 128-dim model in category
Why Choose Our Models?
-
🏆 #1 in 128-dim Class: Our ColQwen3-8b beats ALL other 128-dim models on ViDoRe v1 and v3
-
⚡ Compact Embeddings: All our models use 128 dimensions - same as ColPali/ColQwen2/colSmol
- No storage overhead compared to standard ColPali models
- 2.5x smaller than tomoro (320 dim)
- 16-24x smaller than llama-nemoretriever (2048-3072 dim)
-
💰 Best-in-Class for Every Size:
- Small (<1B): ColLFM2-450M beats colSmol-500M with 10% fewer parameters
- Medium (1-3B): ColQwen3-2b beats colqwen2-v1.0 by +1.01 points
- Large (3-5B): ColQwen3-4b achieves 90.80, only -0.20 behind much larger llama-nemo
- XLarge (5-10B): ColQwen3-8b achieves the highest 128-dim score ever
-
🌍 Multilingual: Trained on 6 languages (EN, DE, FR, ES, IT, PT)
-
🔧 Easy Integration: MTEB-compatible for standardized evaluation
Training
Hardware
| Model Size | GPUs | Effective Batch Size |
|---|---|---|
| 450M - 4B | 4x NVIDIA RTX 6000 Ada (48GB) | 256 |
| 8B | 4x NVIDIA A100 SXM (80GB) | 256 |
Training Datasets
Our models were trained on a diverse mix of public and proprietary datasets:
| Dataset | Type | Description |
|---|---|---|
| vidore/colpali_train_set | Public | Original ColPali training data with document-query pairs |
| openbmb/VisRAG-Ret-Train-In-domain-data | Public | Visual RAG retrieval training data |
| llamaindex/vdr-multilingual-train | Public | Multilingual document retrieval data |
| unicamp-dl/mmarco | Public | Multilingual MS MARCO (used for recovery training) |
| VAGO Multilingual Dataset 1 | In-house | Proprietary multilingual document-query pairs |
| VAGO Multilingual Dataset 2 | In-house | Proprietary multilingual document-query pairs |
Special Training Techniques
| Model | Technique | Description |
|---|---|---|
| ColLFM2-450M | Curriculum Learning | Progressive difficulty training across 4 stages |
| ColLFM2-450M | Hierarchical Merge | Combined mMARCO specialist with retrieval model |
| ColQwen3-1.7b-Turbo | Structured Pruning | Layer + intermediate size pruning (-23% params) |
| ColQwen3-1.7b-Turbo | mMARCO Recovery | Pre-training to heal pruned model |
Installation
# From source (recommended)
pip install git+https://github.com/VAGOsolutions/sauerkrautlm-colpali
# For ColMinistral3 models (requires transformers 5.0.0rc0)
pip install "sauerkrautlm-colpali[ministral]"
Note: ColMinistral3 requires
transformers>=5.0.0rc0. Install withpip install "sauerkrautlm-colpali[ministral]"or manually install the RC version.
Quick Start
ColQwen3 Example
import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColQwen3, ColQwen3Processor
model_name = "VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1"
# Load model and processor
model = ColQwen3.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
).eval()
processor = ColQwen3Processor.from_pretrained(model_name)
# Prepare inputs
images = [Image.open("document.png")]
queries = ["What is the main topic of this document?"]
# Process inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
# Get embeddings
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
# Calculate similarity scores
scores = processor.score(query_embeddings, image_embeddings)
print(f"Similarity scores: {scores}")
ColMinistral3 Example
import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColMinistral3, ColMinistral3Processor
model_name = "VAGOsolutions/SauerkrautLM-ColMinistral3-3b-v0.1"
model = ColMinistral3.from_pretrained(model_name)
model = model.to(dtype=torch.bfloat16, device="cuda:0").eval()
processor = ColMinistral3Processor.from_pretrained(model_name)
# Same usage pattern as ColQwen3...
ColLFM2 Example
import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColLFM2, ColLFM2Processor
model_name = "VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1"
model = ColLFM2.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
).eval()
processor = ColLFM2Processor.from_pretrained(model_name)
# Same usage pattern as ColQwen3...
MTEB Integration
This package includes MTEB (Massive Text Embedding Benchmark) integration for standardized evaluation on the ViDoRe benchmark. See the mteb_integration/ folder for details.
# Example: Run MTEB evaluation
import mteb
model = mteb.get_model("VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1")
tasks = mteb.get_tasks(tasks=["VidoreArxivQARetrieval"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")
Architecture
All models in this package follow the ColPali architecture:
- Vision Encoder: Extracts patch embeddings from document images
- Language Model: Processes visual tokens alongside text tokens
- Projection Layer: Maps hidden states to 128-dimensional embedding space
- Late Interaction: MaxSim scoring between query and document embeddings
Document Image → Vision Encoder → Visual Tokens → LLM → Projection → Multi-Vector Embeddings
↓
Query Text → Tokenizer → LLM → Projection → Multi-Vector Embeddings → MaxSim Score
Original ColPali
This package is based on the excellent work by Illuin Technology. For the original ColPali models (ColPali, ColQwen2, ColQwen2.5, ColSmol), please use the original colpali-engine package.
Citation
If you use this package, please cite both the original ColPali paper and our work:
@misc{sauerkrautlm-colpali-2025,
title={SauerkrautLM-ColPali: Multi-Vector Vision Retrieval Models},
author={David Golchinfar},
organization={VAGO Solutions},
year={2025},
url={https://github.com/VAGOsolutions/sauerkrautlm-colpali}
}
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
📊 Benchmark Visualizations
ViDoRe v1 Benchmark (128-dim Models)
MTEB v1+v2 Benchmark (128-dim Models)
ViDoRe v3 Benchmark (128-dim Models)
Our Models vs High-dim Models
📋 Summary Tables
128-dim Models Comparison
Comparison vs High-dim Models
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
- VAGO Solutions: https://vago-solutions.ai
- GitHub: https://github.com/VAGOsolutions
- Email: info@vago-solutions.ai
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sauerkrautlm_colpali-0.1.0.tar.gz.
File metadata
- Download URL: sauerkrautlm_colpali-0.1.0.tar.gz
- Upload date:
- Size: 16.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51d0a0b0d432d92a51f10b78eb628f0a1736a9e5a2c43e5c37c608fdf1f2075c
|
|
| MD5 |
269442b03585abb1a83f84899aabf0fd
|
|
| BLAKE2b-256 |
ba4a7968f6575d98801d5861936ed2d4f639776626587c5931bd39d97b9fd05d
|
File details
Details for the file sauerkrautlm_colpali-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sauerkrautlm_colpali-0.1.0-py3-none-any.whl
- Upload date:
- Size: 72.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
680b091cca6b496570db690dbabbd31f90c11c37f8d6b3834dd9fa91ae237de8
|
|
| MD5 |
326524c19a83a5a46b26437d1e493158
|
|
| BLAKE2b-256 |
9202636f6039041cb1e8c9000bc1aa845865b20df9d869b01a2d3413a5f31fda
|