Skip to main content

Vision-language OCR and multimodal document QA for images and PDFs.

Project description

Vlense

Vision-language OCR and multimodal document QA for images and PDFs.

Vlense helps you do two things well:

  • extract structured or free-form content from images and PDFs with vision models
  • build a page-level retrieval index over documents and ask grounded questions with citations

It is designed for workflows where plain OCR is not enough and the model needs to reason over full document pages, scans, tables, forms, and mixed visual layouts.

What It Does

  • OCR for images and PDFs with Markdown, HTML, or JSON output
  • Pydantic schema support for structured extraction
  • Page-image indexing for PDFs and image collections
  • Text-layer BM25 retrieval for PDFs
  • Multimodal retrieval with colpali-engine
  • Grounded question answering over retrieved document pages
  • Async Python API with a small surface area

Installation

Install the package:

uv add vlense

Or install from source in this repository:

uv sync

PDF rendering uses pdf2image, so Poppler must be available on your system.

Quick Start

OCR

import asyncio
import os

from vlense import Vlense


async def main():
    os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

    vlense = Vlense()
    result = await vlense.ocr(
        file_path=["./invoice.png", "./report.pdf"],
        model="gpt-5-mini",
        format="markdown",
    )

    print(result["invoice.png"].content)


if __name__ == "__main__":
    asyncio.run(main())

Document QA

import asyncio
import os

from vlense import Vlense


async def main():
    os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

    vlense = Vlense()

    await vlense.index(
        data_dir="./handbook.pdf",
        collection_name="company-docs",
        index_dir="./.vlense",
        retrieval="hybrid",
        retriever_model="vidore/colSmol-500M",
    )

    answer = await vlense.ask(
        query="What are the eligibility requirements?",
        collection_name="company-docs",
        index_dir="./.vlense",
        model="gpt-5-mini",
        top_k=3,
    )

    print(answer)


if __name__ == "__main__":
    asyncio.run(main())

Vlense.ask() returns a grounded answer based on the retrieved page images, with cited page references.

For OpenAI-compatible gateways, set OPENAI_BASE_URL or pass base_url= directly to Vlense.ocr() and Vlense.ask().

Retrieval Model

Vlense uses colpali-engine for page-image retrieval and defaults to vidore/colSmol-500M.

For PDFs with a usable text layer, Vlense also supports:

  • retrieval="bm25" for lexical text retrieval with page-grounded answer synthesis
  • retrieval="hybrid" to combine BM25 text retrieval with ColPali page-image retrieval

This gives you:

  • document-aware visual retrieval instead of plain text-only chunking
  • a smaller default retriever than the heavier ColQwen variants
  • a local collection format that stores rendered pages plus embeddings for reuse

Example CLI

The repository includes a runnable example for PDF question answering:

uv run python examples/pdf_qa.py ./document.pdf \
  --collection my-docs \
  --question "What does the report say about pricing?" \
  --vision-model gpt-5-mini

API Overview

Vlense.ocr()

Runs OCR over one or more images or PDFs and returns generated content in Markdown, HTML, or JSON.

Key options:

  • file_path: single path or list of paths
  • model: OpenAI-compatible vision-capable model name
  • format: markdown, html, or json
  • json_schema: optional Pydantic schema for structured extraction
  • output_dir: optional directory for persisted outputs
  • api_key: optional API key override
  • base_url: optional OpenAI-compatible base URL override

Vlense.index()

Builds a local multimodal retrieval collection from PDFs or images.

Key options:

  • data_dir: file path, list of paths, or directory
  • collection_name: logical name for the collection
  • index_dir: storage root for page renders and embeddings
  • retrieval: colpali, bm25, or hybrid
  • retriever_model: colpali-engine checkpoint name

Vlense.ask()

Searches an indexed collection, retrieves the most relevant pages, and asks a vision model to answer using those pages as evidence.

Key options:

  • query: user question
  • collection_name: existing indexed collection
  • model: answer model such as gpt-5-mini
  • top_k: number of retrieved pages to ground the answer
  • retrieval: optional override for colpali, bm25, or hybrid
  • api_key: optional API key override
  • base_url: optional OpenAI-compatible base URL override

Release Workflow

GitHub Actions runs CI on pushes and pull requests. Tagged releases publish to PyPI and create a GitHub Release.

Repository setup:

  • add a repository secret named PYPI_API_TOKEN

Release flow:

git tag v0.2.5
git push origin v0.2.5

Development

This repository uses uv, not pip.

Useful commands:

uv sync
uv run python -m unittest vlense.tests.test_vlense
uv build

Contributing

Issues and pull requests are welcome.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vlense-0.2.6.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vlense-0.2.6-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file vlense-0.2.6.tar.gz.

File metadata

  • Download URL: vlense-0.2.6.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vlense-0.2.6.tar.gz
Algorithm Hash digest
SHA256 03e35471220d6a848e01824a94333144fc6734e6712418cfc257842bc8c3a9fb
MD5 7cf742f6dd379c52c2bb94b7218b2ccd
BLAKE2b-256 9f422c5d65e3c99e37956d6ae0a8c27ad41b2e1bfbb9aadaaeda0c8477be68e5

See more details on using hashes here.

File details

Details for the file vlense-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: vlense-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vlense-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f378c8caaa3d0ec67db275fb4af02f962bce9f539739dacccf134b0b192f7f2b
MD5 f6b965c7d7c5846380bc159a9b77ee34
BLAKE2b-256 789dbcd1d60091f3b9d46075e78814ddad1f821c03ec44c25355e149f0c0a89c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page