A Python package to extract text from images and PDFs using Vision Language Model (VLM).
Project description
Vlense
A Python package to extract text from images and PDFs using Vision Language Models (VLM).
Features
- Extract text from images and PDFs
- Supports JSON, HTML, and Markdown formats
- Easy integration with Vision Language Models
- Asynchronous processing with batch support
- Custom JSON schema for structured output
- Build page-image embedding collections for multimodal RAG
- Ask grounded questions over indexed PDFs and images
Installation
uv sync
Usage
import os
import asyncio
from vlense import Vlense
from pydantic import BaseModel
path = ["./images/image1.jpg", "test.pdf"]
output_dir = "./output"
model = "openai/gpt-5-mini"
temp_dir = "./temp_images"
os.environ["GEMINI_API_KEY"] = "YOUR_API_KEY"
async def main():
vlense = Vlense()
responses = await vlense.ocr(
file_path=path,
model=model,
output_dir=output_dir,
temp_dir=temp_dir,
batch_size=3,
clean_temp_files=False,
)
if __name__ == "__main__":
asyncio.run(main())
Multimodal RAG
Vlense.index() builds a colpali-engine retrieval index for PDFs and images. Vlense.ask() searches that index for the most relevant document pages and sends the retrieved page images to the vision model to answer with citations.
import asyncio
from vlense import Vlense
async def main():
vlense = Vlense()
await vlense.index(
data_dir=["./handbook.pdf", "./diagram.png"],
collection_name="company-docs",
index_dir="./.vlense",
retriever_model="vidore/colSmol-500M",
)
answer = await vlense.ask(
query="What are the eligibility requirements?",
collection_name="company-docs",
index_dir="./.vlense",
model="openai/gpt-5-mini",
top_k=3,
)
print(answer)
if __name__ == "__main__":
asyncio.run(main())
Retrieval uses colpali-engine directly and defaults to vidore/colSmol-500M, which is smaller than full ColQwen2-family checkpoints while staying document-focused.
API
Vlense.ocr()
Performs OCR on the provided files.
Parameters:
-
file_path : (Union[str, List[str]]): Path or list of paths to PDF/image files.
-
model : (str, optional): Model name for generating completions. Defaults to
"gemini-flash-latest". -
output_dir : (Optional[str], optional): Directory to save output. Defaults to
None. -
temp_dir : (Optional[str], optional): Directory for temporary files. Defaults to system temp.
-
batch_size : (int, optional): Number of concurrent processes. Defaults to
3. -
format : (str, optional): Output format (
'markdown','html','json'). Defaults to'markdown'. -
json_schema : (Optional[Type[BaseModel]], optional): Pydantic model for JSON output. Required if format is
'json'. -
clean_temp_files : (Optional[bool], optional): Cleanup temporary files after processing. Defaults to
True.
Returns:
- Dict[str, VlenseResponse] : Generated content.
Vlense.index()
Indexes one or more PDFs or images into a local page-image retrieval collection.
Parameters:
- data_dir : (Union[str, List[str]]): File path, list of file paths, or a directory containing supported PDF/image files.
- collection_name : (str): Name of the collection to create or replace.
- index_dir : (str, optional): Root directory used to store indexed collections. Defaults to
".vlense". - retriever_model : (str, optional):
colpali-enginemodel name. Defaults to"vidore/colSmol-500M". - embedding_batch_size : (int, optional): Batch size used while embedding page images. Defaults to
2. - temp_dir : (Optional[str], optional): Temporary directory for rendered PDF pages.
Returns:
- str : Path to the collection manifest.
Vlense.ask()
Answers a question using retrieved page images from a previously indexed collection.
Parameters:
- query : (str): User question.
- collection_name : (str): Indexed collection name.
- model : (str, optional): Vision model used for grounded answering. Defaults to
"gemini-flash-latest". - index_dir : (str, optional): Root directory where collections are stored. Defaults to
".vlense". - top_k : (int, optional): Number of retrieved pages to send to the model. Defaults to
3.
Returns:
- str : Grounded answer with cited pages.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contact
Author: Aditya Miskin
Email: adityamiskin98@gmail.com
Repository: https://github.com/adityamiskin/vlense
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vlense-0.2.1.tar.gz.
File metadata
- Download URL: vlense-0.2.1.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e30c84aa3f766c8baf86a00ce42cb2abfe1c86dbb82630192d3c60783dc255fc
|
|
| MD5 |
da45f83e0463c533ecc127836b346240
|
|
| BLAKE2b-256 |
011927ba4326afef0f36c6583312568eecbd3dd0d25b512f8f4599f6602e06d3
|
File details
Details for the file vlense-0.2.1-py3-none-any.whl.
File metadata
- Download URL: vlense-0.2.1-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
376b03d9b6be08e4eb57fb6373cbb14874024b08a56ccb8fba1cbbad456d2533
|
|
| MD5 |
fa9a79bd19a6cd5d5f296b47aeea3586
|
|
| BLAKE2b-256 |
cd5d713cf372a7cd4a746d4e894c253d3c5f9179e2361df4c9eeb92ab83ec84c
|