Skip to main content

A Python package for OCR using Vision LLMs

Project description

bOCR: OCR Framework with Vision LLMs

bOCR is an Optical Character Recognition (OCR) framework that uses Vision Large Language Models (VLLMs) for text extraction and document processing.

Features

  • Minimal Setup: Requires just a single backbone file (e.g., qwen.py or ollamas.py) for OCR execution, making it lightweight and easy to use.
  • Broad Vision LLM Support: Integrates with vision LLMs like Qwen, Llama, Phi, and various VLLMs included in the Ollama package.
  • Customizable Prompts: Fine-tune OCR output using either a custom or default prompt.
  • Automated Preprocessing: Image denoising, resizing, and PDF-to-image conversion.
  • Postprocessing & Export: Supports merging pages and multiple export formats (plain, markdown, docx, pdf).
  • Configurable Pipeline: A single Config object centralizes OCR settings.
  • Detailed Logging: Integrated verbose logging for insights and debugging.

Installation

Install from PyPI (Recommended)

pip install bocr

Install from Source (Development Version)

git clone https://github.com/adrianphoulady/bocr.git
cd bocr
pip install .

Required Dependencies

For PDF and document processing, poppler, pandoc, and LaTeX are also required. You can install them as follows:

Linux (Debian/Ubuntu)

sudo apt install poppler-utils pandoc texlive-xetex texlive-fonts-recommended lmodern

macOS (using Homebrew)

brew install poppler pandoc --cask mactex-no-gui

Windows (using Chocolatey)

choco install poppler pandoc miktex

Quick Start

Simple Example (Single File OCR)

Any backbone file in the backbones module, like qwen.py, is all you need to run OCR on an image:

from bocr.backbones.qwen import extract_text

result = extract_text("sample1.png")
print(result)

Advanced Usage

from bocr import Config, ocr

config = Config(model_id="Qwen/Qwen2-VL-7B-Instruct", export_results=True, export_format="pdf", verbose=True)
files = ["sample2.pdf"]
results = ocr(files, config)
print(results)

Command Line Example

bocr sample1.jpg --export-results --export-format docx --verbose

Configuration

The Config class centralizes OCR settings. Key parameters:

Parameter Type Description Default
prompt str/None Custom OCR prompt or None for default. None
model_id str Vision LLM model identifier. Qwen/Qwen2.5-VL-3B-Instruct
max_new_tokens int Max tokens generated by model. 1024
preprocess bool Enable preprocessing of input files. False
resolution int DPI for PDF-to-image conversion. 150
max_image_size int/None Resize images to a max size. No resizing if None. 1920
result_format str Output format (plain, markdown). md
merge_text bool Merge extracted text. False
export_results bool Save results to files. False
export_format str File output format (txt, md, docx, pdf). md
export_dir str/None Directory for output files. ./ocr_exports if None. None
verbose bool Enables detailed logging for debugging. False

OCR Pipeline

1. Preprocessing

  • URL Handling: Downloads remote files if input is a URL.
  • PDF Conversion: Converts PDFs into image format (requires poppler installed and in PATH).
  • Image Enhancement: Applies denoising and contrast adjustment.
  • Resizing: Optimizes images for Vision LLMs.

2. Text Extraction

  • Extracts text using Vision LLMs, with support for custom prompts for tailored OCR instructions.

3. Postprocessing

  • Formats and merges extracted text in specified format.
  • Converts it into specified export formats (e.g., Markdown, PDF).
  • Saves results if configured.

Logging

Enable logging by setting verbose=True in the Config object. Logs provide insights into preprocessing, extraction, and postprocessing steps.


Supported Models

bOCR supports Vision LLMs such as:

  • Qwen/Qwen2.5-VL-3B-Instruct
  • Qwen/Qwen2.5-VL-7B-Instruct
  • Qwen/Qwen2.5-VL-72B-Instruct
  • Qwen/Qwen2-VL-2B-Instruct
  • Qwen/Qwen2-VL-7B-Instruct
  • Qwen/Qwen2-VL-72B-Instruct
  • Qwen/QVQ-72B-Preview
  • meta-llama/Llama-3.2-11B-Vision-Instruct
  • meta-llama/Llama-3.2-90B-Vision-Instruct
  • microsoft/Phi-3.5-vision-instruct
  • llama3.2-vision:11b from Ollama
  • llama3.2-vision:90b from Ollama

Additional models can be supported by implementing a new backbone in bocr/backbones/ and updating mappings.yaml.


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bocr-0.2.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bocr-0.2.0-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file bocr-0.2.0.tar.gz.

File metadata

  • Download URL: bocr-0.2.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for bocr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 031a8fe427e5cb1adf0671914ab21f3dd11ab249e1e529d9d49672b79df13b48
MD5 a81bfa7185a9d18ed139cdbcc28f493c
BLAKE2b-256 602fa7ecad814ecf6cc96c9af87e5e8e1344aac7df407827521c2016706ae7b1

See more details on using hashes here.

File details

Details for the file bocr-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: bocr-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for bocr-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 280f29fce59314f8832a157f6f952f8aaefa20d2287a9d37cb0cc0448aff8941
MD5 0e6c99a3b86426755f7c825bcdff288d
BLAKE2b-256 7593c592b15c0384a4bbe7b7597f2d8736b55ac29128903b42042285d3a6acbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page