A Python package for OCR using Vision LLMs
Project description
bOCR: OCR Framework with Vision LLMs
bOCR is an Optical Character Recognition (OCR) framework that uses Vision Large Language Models (VLLMs) for text extraction and document processing.
Features
- Minimal Setup: Requires just a single backbone file (e.g.,
qwen.pyorollamas.py) for OCR execution, making it lightweight and easy to use. - Broad Vision LLM Support: Integrates with vision LLMs like
Qwen,Llama,Phi, and various VLLMs included in theOllamapackage. - Customizable Prompts: Fine-tune OCR output using either a custom or default prompt.
- Automated Preprocessing: Image denoising, resizing, and PDF-to-image conversion.
- Postprocessing & Export: Supports merging pages and multiple export formats (
plain,markdown,docx,pdf). - Configurable Pipeline: A single
Configobject centralizes OCR settings. - Detailed Logging: Integrated verbose logging for insights and debugging.
Installation
Install from PyPI (Recommended)
pip install bocr
Install from Source (Development Version)
git clone https://github.com/adrianphoulady/bocr.git
cd bocr
pip install .
Required Dependencies
For PDF and document processing, poppler, pandoc, and LaTeX are also required. You can install them as follows:
Linux (Debian/Ubuntu)
sudo apt install poppler-utils pandoc texlive-xetex texlive-fonts-recommended lmodern
macOS (using Homebrew)
brew install poppler pandoc --cask mactex-no-gui
Windows (using Chocolatey)
choco install poppler pandoc miktex
Quick Start
Simple Example (Single File OCR)
Any backbone file in the backbones module, like qwen.py, is all you need to run OCR on an image:
from bocr.backbones.qwen import extract_text
result = extract_text("sample1.png")
print(result)
Advanced Usage
from bocr import Config, ocr
config = Config(model_id="Qwen/Qwen2-VL-7B-Instruct", export_results=True, export_format="pdf", verbose=True)
files = ["sample2.pdf"]
results = ocr(files, config)
print(results)
Command Line Example
bocr sample1.jpg --export-results --export-format docx --verbose
Configuration
The Config class centralizes OCR settings. Key parameters:
| Parameter | Type | Description | Default |
|---|---|---|---|
prompt |
str/None |
Custom OCR prompt or None for default. |
None |
model_id |
str |
Vision LLM model identifier. | Qwen/Qwen2.5-VL-3B-Instruct |
max_new_tokens |
int |
Max tokens generated by model. | 1024 |
preprocess |
bool |
Enable preprocessing of input files. | False |
resolution |
int |
DPI for PDF-to-image conversion. | 150 |
max_image_size |
int/None |
Resize images to a max size. No resizing if None. |
1920 |
result_format |
str |
Output format (plain, markdown). |
md |
merge_text |
bool |
Merge extracted text. | False |
export_results |
bool |
Save results to files. | False |
export_format |
str |
File output format (txt, md, docx, pdf). |
md |
export_dir |
str/None |
Directory for output files. ./ocr_exports if None. |
None |
verbose |
bool |
Enables detailed logging for debugging. | False |
OCR Pipeline
1. Preprocessing
- URL Handling: Downloads remote files if input is a URL.
- PDF Conversion: Converts PDFs into image format (requires
popplerinstalled and inPATH). - Image Enhancement: Applies denoising and contrast adjustment.
- Resizing: Optimizes images for Vision LLMs.
2. Text Extraction
- Extracts text using Vision LLMs, with support for custom prompts for tailored OCR instructions.
3. Postprocessing
- Formats and merges extracted text in specified format.
- Converts it into specified export formats (e.g., Markdown, PDF).
- Saves results if configured.
Logging
Enable logging by setting verbose=True in the Config object. Logs provide insights into preprocessing, extraction, and postprocessing steps.
Supported Models
bOCR supports Vision LLMs such as:
Qwen/Qwen2.5-VL-3B-InstructQwen/Qwen2.5-VL-7B-InstructQwen/Qwen2.5-VL-72B-InstructQwen/Qwen2-VL-2B-InstructQwen/Qwen2-VL-7B-InstructQwen/Qwen2-VL-72B-InstructQwen/QVQ-72B-Previewmeta-llama/Llama-3.2-11B-Vision-Instructmeta-llama/Llama-3.2-90B-Vision-Instructmicrosoft/Phi-3.5-vision-instructllama3.2-vision:11bfrom Ollamallama3.2-vision:90bfrom Ollama
Additional models can be supported by implementing a new backbone in bocr/backbones/ and updating mappings.yaml.
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bocr-0.2.0.tar.gz.
File metadata
- Download URL: bocr-0.2.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
031a8fe427e5cb1adf0671914ab21f3dd11ab249e1e529d9d49672b79df13b48
|
|
| MD5 |
a81bfa7185a9d18ed139cdbcc28f493c
|
|
| BLAKE2b-256 |
602fa7ecad814ecf6cc96c9af87e5e8e1344aac7df407827521c2016706ae7b1
|
File details
Details for the file bocr-0.2.0-py3-none-any.whl.
File metadata
- Download URL: bocr-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
280f29fce59314f8832a157f6f952f8aaefa20d2287a9d37cb0cc0448aff8941
|
|
| MD5 |
0e6c99a3b86426755f7c825bcdff288d
|
|
| BLAKE2b-256 |
7593c592b15c0384a4bbe7b7597f2d8736b55ac29128903b42042285d3a6acbe
|