Skip to main content

Package containing utility function for hOCR and tesseract

Project description

hocr_utils

ci_master

pypi_package_version

pypi_python_version

hocr_utils is a package to transform, plot and simplify the use of hOCR files.

Installation

Dependencies

hocr-utils requires:

  • Python (>= |3.7|)

Optional Dependencies

The functions to plot, transform pdf into hOCR require the following additional dependencies:

  • pytesseract
  • pdf2image
  • opencv-python

Additionaly tesseract language pack need to be install for non-english ocr.

Example: install french language package on ubuntu with:

apt-get install tesseract-ocr-fra

User Installation

The easiest way to install scikit-learn is using pip:

pip install -U hocr_utils

Usecases

Transform PIL Images to hOCR

Requires pytesseract dependency and the requested tesseract language pack.

from hocr_utils import utils
from PIL import Image

image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])

Transform pdf to hOCR

Requires pytesseract, pdf2image dependencies as well as the requested tesseract language pack.

from hocr_utils import utils

hocr = utils.pdf_to_hocr('./data/sample.pdf')

Transform hOCR to list of dictionary

from hocr_utils import utils

hocr_dict = utils.hocr_to_dict(hocr)

This can then be transformed into pandas dataFrame using pd.dataFrame.from_records(hocr_dict)

By default there will be one entry per line in the hOCR, However it is possible to group the list by ['paragraph', 'line', 'word'] using by argument.

Get a single page from hOCR

from hocr_utils import utils

hocr_1 = utils.get_page(hocr, 1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hocr_utils-0.0.3.tar.gz (7.5 kB view hashes)

Uploaded Source

Built Distribution

hocr_utils-0.0.3-py3-none-any.whl (7.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page