Skip to main content

Package containing utility function for hOCR and tesseract

Project description

hocr_utils

ci_master

pypi_package_version

pypi_python_version

hocr_utils is a package to transform, plot and simplify the use of hOCR files.

Installation

Dependencies

hocr-utils requires:

  • Python (>= |3.7|)

Optional Dependencies

The functions to plot, transform pdf into hOCR require the following additional dependencies:

  • pytesseract
  • pdf2image
  • opencv-python

Additionaly tesseract language pack need to be install for non-english ocr.

Example: install french language package on ubuntu with:

apt-get install tesseract-ocr-fra

User Installation

The easiest way to install scikit-learn is using pip:

pip install -U hocr_utils

Usecases

Transform PIL Images to hOCR

Requires pytesseract dependency and the requested tesseract language pack.

from hocr_utils import utils
from PIL import Image

image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])

Transform pdf to hOCR

Requires pytesseract, pdf2image dependencies as well as the requested tesseract language pack.

from hocr_utils import utils

hocr = utils.pdf_to_hocr('./data/sample.pdf')

Transform hOCR to list of dictionary

from hocr_utils import utils

hocr_dict = utils.hocr_to_dict(hocr)

This can then be transformed into pandas dataFrame using pd.dataFrame.from_records(hocr_dict)

By default there will be one entry per line in the hOCR, However it is possible to group the list by ['paragraph', 'line', 'word'] using by argument.

Get a single page from hOCR

from hocr_utils import utils

hocr_1 = utils.get_page(hocr, 1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hocr_utils-0.0.3.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

hocr_utils-0.0.3-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file hocr_utils-0.0.3.tar.gz.

File metadata

  • Download URL: hocr_utils-0.0.3.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for hocr_utils-0.0.3.tar.gz
Algorithm Hash digest
SHA256 d3874d8a19d318402004107deb3d2c80fed3197aba44fe2cde7f14275ba4c695
MD5 9525d26ba5d97ba8720302e0cfdc0cd1
BLAKE2b-256 4d2be6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04

See more details on using hashes here.

File details

Details for the file hocr_utils-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: hocr_utils-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for hocr_utils-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 27420555a2efe72fa0d2468df121461b072f167d19fcfe88d955aaa30ba3ffb9
MD5 93ca97f488fd0d78c2c5a8496dcbab58
BLAKE2b-256 dc108627bb7bc0379b18f70591b182d902c819fccfd3724158a1eb6429da9d1d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page