Vision-language OCR and multimodal document QA for images and PDFs.
Project description
Vlense
Vision-language OCR and multimodal document QA for images and PDFs.
Vlense helps you do two things well:
- extract structured or free-form content from images and PDFs with vision models
- build a page-level retrieval index over documents and ask grounded questions with citations
It is designed for workflows where plain OCR is not enough and the model needs to reason over full document pages, scans, tables, forms, and mixed visual layouts.
What It Does
- OCR for images and PDFs with Markdown, HTML, or JSON output
- Pydantic schema support for structured extraction
- Page-image indexing for PDFs and image collections
- Multimodal retrieval with
colpali-engine - Grounded question answering over retrieved document pages
- Async Python API with a small surface area
Installation
Install the package:
uv add vlense
Or install from source in this repository:
uv sync
PDF rendering uses pdf2image, so Poppler must be available on your system.
Quick Start
OCR
import asyncio
import os
from vlense import Vlense
async def main():
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
vlense = Vlense()
result = await vlense.ocr(
file_path=["./invoice.png", "./report.pdf"],
model="openai/gpt-5-mini",
format="markdown",
)
print(result["invoice.png"].content)
if __name__ == "__main__":
asyncio.run(main())
Document QA
import asyncio
import os
from vlense import Vlense
async def main():
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
vlense = Vlense()
await vlense.index(
data_dir="./handbook.pdf",
collection_name="company-docs",
index_dir="./.vlense",
retriever_model="vidore/colSmol-500M",
)
answer = await vlense.ask(
query="What are the eligibility requirements?",
collection_name="company-docs",
index_dir="./.vlense",
model="openai/gpt-5-mini",
top_k=3,
)
print(answer)
if __name__ == "__main__":
asyncio.run(main())
Vlense.ask() returns a grounded answer based on the retrieved page images, with cited page references.
Retrieval Model
Vlense uses colpali-engine for page-image retrieval and defaults to vidore/colSmol-500M.
This gives you:
- document-aware visual retrieval instead of plain text-only chunking
- a smaller default retriever than the heavier ColQwen variants
- a local collection format that stores rendered pages plus embeddings for reuse
Example CLI
The repository includes a runnable example for PDF question answering:
uv run python examples/pdf_qa.py ./document.pdf \
--collection my-docs \
--question "What does the report say about pricing?" \
--vision-model openai/gpt-5-mini
API Overview
Vlense.ocr()
Runs OCR over one or more images or PDFs and returns generated content in Markdown, HTML, or JSON.
Key options:
file_path: single path or list of pathsmodel: vision-capable model nameformat:markdown,html, orjsonjson_schema: optional Pydantic schema for structured extractionoutput_dir: optional directory for persisted outputs
Vlense.index()
Builds a local multimodal retrieval collection from PDFs or images.
Key options:
data_dir: file path, list of paths, or directorycollection_name: logical name for the collectionindex_dir: storage root for page renders and embeddingsretriever_model:colpali-enginecheckpoint name
Vlense.ask()
Searches an indexed collection, retrieves the most relevant pages, and asks a vision model to answer using those pages as evidence.
Key options:
query: user questioncollection_name: existing indexed collectionmodel: answer model such asopenai/gpt-5-minitop_k: number of retrieved pages to ground the answer
Release Workflow
GitHub Actions runs CI on pushes and pull requests. Tagged releases publish to PyPI and create a GitHub Release.
Repository setup:
- add a repository secret named
PYPI_API_TOKEN
Release flow:
git tag v0.2.4
git push origin v0.2.4
Development
This repository uses uv, not pip.
Useful commands:
uv sync
uv run python -m unittest vlense.tests.test_vlense
uv build
Contributing
Issues and pull requests are welcome.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vlense-0.2.4.tar.gz.
File metadata
- Download URL: vlense-0.2.4.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0e00fdd1a46d63ec4a76d0e2d4a041c9b81d8adccd010d68a0b5cfdaca9873c
|
|
| MD5 |
cf6d62309db879bd077e3cf93b75c542
|
|
| BLAKE2b-256 |
533ed13850d5207b25e23438775688a9a63da4ae9f44c05c3d2d1922c76a8ee9
|
File details
Details for the file vlense-0.2.4-py3-none-any.whl.
File metadata
- Download URL: vlense-0.2.4-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06dff85826d327393cc34c823bd052ab44563e78a8bb4dd46320e1533237c018
|
|
| MD5 |
fab21a21b76f495e10d043bc9cecaca0
|
|
| BLAKE2b-256 |
dce1d011408a2c5d7edfd47c173ae38304427a9dccadbd03bb3307c9aaced066
|