Python package for combining .hocr files and images into searchable PDFs
Project description
HOCkeR
Python package for combining hOCR files and images into searchable PDFs
Table of Contents
What is hOCkeR?
HOCkeR is a Python package for combining hOCR files and images into searchable PDFs. The package lays the text on top of the image, and then creates a PDF with the text and image. The code used is from HOCRConverter by jbrinley. The code was designed for Python 2, therefore does not work with newer version of python, so I created this package as an update to the original code.
How to install
To install the package, run the following command within a python environment:
pip install hocker
If any errors occur whilst installing, try using the .whl file instead linked here
How to use hOCkeR
Below is an example of how to use hOCkeR to combine an png and a .hocr file into a PDF
import hOCkeR as hkr
image_path = 'path/to/image.png'
hocr_path = 'path/to/image.hocr'
# Specify the element in the hocr file to use as the text
hocr = hOCR('ocrx_word') # For tesseract outputs, it is 'ocrx_word'
# Specify the hocr and image path
hocr.locate_image(image_path)
hocr.locate_hocr(hocr_path)
# Output the PDF
hocr.to_pdf('path/to/output.pdf')
Credits & links
- hOCKeR by Lucas Warwick
- HOCRConverter by jbrinley
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.