Vision-language OCR and multimodal document QA for images and PDFs.

These details have not been verified by PyPI

Project links

Project description

Vlense

Vision-language OCR and multimodal document QA for images and PDFs.

Vlense helps you do two things well:

extract structured or free-form content from images and PDFs with vision models
build a page-level retrieval index over documents and ask grounded questions with citations

It is designed for workflows where plain OCR is not enough and the model needs to reason over full document pages, scans, tables, forms, and mixed visual layouts.

What It Does

OCR for images and PDFs with Markdown, HTML, or JSON output
Pydantic schema support for structured extraction
Page-image indexing for PDFs and image collections
Text-layer BM25 retrieval for PDFs
Multimodal retrieval with colpali-engine
Grounded question answering over retrieved document pages
Async Python API with a small surface area

Installation

Install the package:

uv add vlense

Or install from source in this repository:

uv sync

PDF rendering uses pdf2image, so Poppler must be available on your system.

Quick Start

OCR

import asyncio
import os

from vlense import Vlense


async def main():
    os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

    vlense = Vlense()
    result = await vlense.ocr(
        file_path=["./invoice.png", "./report.pdf"],
        model="gpt-5-mini",
        format="markdown",
    )

    print(result["invoice.png"].content)


if __name__ == "__main__":
    asyncio.run(main())

Document QA

import asyncio
import os

from vlense import Vlense


async def main():
    os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

    vlense = Vlense()

    await vlense.index(
        data_dir="./handbook.pdf",
        collection_name="company-docs",
        index_dir="./.vlense",
        retrieval="hybrid",
        retriever_model="vidore/colSmol-500M",
    )

    answer = await vlense.ask(
        query="What are the eligibility requirements?",
        collection_name="company-docs",
        index_dir="./.vlense",
        model="gpt-5-mini",
        top_k=3,
    )

    print(answer)


if __name__ == "__main__":
    asyncio.run(main())

Vlense.ask() returns a grounded answer based on the retrieved page images, with cited page references.

For OpenAI-compatible gateways, set OPENAI_BASE_URL or pass base_url= directly to Vlense.ocr() and Vlense.ask().

Retrieval Model

Vlense uses colpali-engine for page-image retrieval and defaults to vidore/colSmol-500M.

For PDFs with a usable text layer, Vlense also supports:

retrieval="bm25" for lexical text retrieval with page-grounded answer synthesis
retrieval="hybrid" to combine BM25 text retrieval with ColPali page-image retrieval

This gives you:

document-aware visual retrieval instead of plain text-only chunking
a smaller default retriever than the heavier ColQwen variants
a local collection format that stores rendered pages plus embeddings for reuse

Example CLI

The repository includes a runnable example for PDF question answering:

uv run python examples/pdf_qa.py ./document.pdf \
  --collection my-docs \
  --question "What does the report say about pricing?" \
  --vision-model gpt-5-mini

API Overview

`Vlense.ocr()`

Runs OCR over one or more images or PDFs and returns generated content in Markdown, HTML, or JSON.

Key options:

file_path: single path or list of paths
model: OpenAI-compatible vision-capable model name
format: markdown, html, or json
json_schema: optional Pydantic schema for structured extraction
output_dir: optional directory for persisted outputs
api_key: optional API key override
base_url: optional OpenAI-compatible base URL override

`Vlense.index()`

Builds a local multimodal retrieval collection from PDFs or images.

Key options:

data_dir: file path, list of paths, or directory
collection_name: logical name for the collection
index_dir: storage root for page renders and embeddings
retrieval: colpali, bm25, or hybrid
retriever_model: colpali-engine checkpoint name

`Vlense.ask()`

Searches an indexed collection, retrieves the most relevant pages, and asks a vision model to answer using those pages as evidence.

Key options:

query: user question
collection_name: existing indexed collection
model: answer model such as gpt-5-mini
top_k: number of retrieved pages to ground the answer
retrieval: optional override for colpali, bm25, or hybrid
api_key: optional API key override
base_url: optional OpenAI-compatible base URL override

Release Workflow

GitHub Actions runs CI on pushes and pull requests. Tagged releases publish to PyPI and create a GitHub Release.

Repository setup:

add a repository secret named PYPI_API_TOKEN

Release flow:

git tag v0.2.5
git push origin v0.2.5

Development

This repository uses uv, not pip.

Useful commands:

uv sync
uv run python -m unittest vlense.tests.test_vlense
uv build

Contributing

Issues and pull requests are welcome.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.6

Mar 27, 2026

0.2.4

Mar 24, 2026

0.2.3

Mar 13, 2026

0.2.1

Mar 13, 2026

0.1.4

Nov 6, 2024

0.1.3

Nov 6, 2024

0.1.2

Nov 6, 2024

0.1.1

Nov 2, 2024

0.1.0

Nov 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vlense-0.2.6.tar.gz (25.1 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vlense-0.2.6-py3-none-any.whl (28.4 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file vlense-0.2.6.tar.gz.

File metadata

Download URL: vlense-0.2.6.tar.gz
Upload date: Mar 27, 2026
Size: 25.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vlense-0.2.6.tar.gz
Algorithm	Hash digest
SHA256	`03e35471220d6a848e01824a94333144fc6734e6712418cfc257842bc8c3a9fb`
MD5	`7cf742f6dd379c52c2bb94b7218b2ccd`
BLAKE2b-256	`9f422c5d65e3c99e37956d6ae0a8c27ad41b2e1bfbb9aadaaeda0c8477be68e5`

See more details on using hashes here.

File details

Details for the file vlense-0.2.6-py3-none-any.whl.

File metadata

Download URL: vlense-0.2.6-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 28.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vlense-0.2.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f378c8caaa3d0ec67db275fb4af02f962bce9f539739dacccf134b0b192f7f2b`
MD5	`f6b965c7d7c5846380bc159a9b77ee34`
BLAKE2b-256	`789dbcd1d60091f3b9d46075e78814ddad1f821c03ec44c25355e149f0c0a89c`

See more details on using hashes here.

vlense 0.2.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Vlense

What It Does

Installation

Quick Start

OCR

Document QA

Retrieval Model

Example CLI

API Overview

Vlense.ocr()

Vlense.index()

Vlense.ask()

Release Workflow

Development

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Vlense.ocr()`

`Vlense.index()`

`Vlense.ask()`