Skip to main content

Package containing utility function for hOCR and tesseract

Project description

hocr_utils

example workflow

hocr_utils is a package to transform, plot and simplify the use of hOCR files.

Installation

Dependencies

hocr-utils requires:

  • Python (>= |3.7|)

Optional Dependencies

The functions to plot, transform pdf into hOCR require the following additional dependencies:

  • pytesseract
  • pdf2image
  • opencv-python

Additionaly tesseract language pack need to be install for non-english ocr.

Example: install french language package on ubuntu with:

apt-get install tesseract-ocr-fra

User Installation

The easiest way to install scikit-learn is using pip:

pip install -U hocr_utils

Usecases

Transform PIL Images to hOCR

Requires pytesseract dependency and the requested tesseract language pack.

from hocr_utils import utils
from PIL import Image

image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])

Transform pdf to hOCR

Requires pytesseract, pdf2image dependencies as well as the requested tesseract language pack.

from hocr_utils import utils

hocr = utils.pdf_to_hocr('./data/sample.pdf')

Transform hOCR to list of dictionary

from hocr_utils import utils

hocr_dict = utils.hocr_to_dict(hocr)

This can then be transformed into pandas dataFrame using pd.dataFrame.from_records(hocr_dict)

By default there will be one entry per line in the hOCR, However it is possible to group the list by ['paragraph', 'line', 'word'] using by argument.

Get a single page from hOCR

from hocr_utils import utils

hocr_1 = utils.get_page(hocr, 1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hocr_utils-0.0.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

hocr_utils-0.0.1-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file hocr_utils-0.0.1.tar.gz.

File metadata

  • Download URL: hocr_utils-0.0.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for hocr_utils-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b9460e12c179a2998f6216c11fcc762b478e014903449156d4af900bf1dc59ab
MD5 63ecf44b7407423d083e452792e9377a
BLAKE2b-256 b2d17f7bed6a347954db468ef32d7052daee33d4eaece0aaf724dd03de414eda

See more details on using hashes here.

File details

Details for the file hocr_utils-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: hocr_utils-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for hocr_utils-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d3cd81426808a854ad1150b8a3aeb1035bd85fac74c88ddabb5d2f9b5463ebfa
MD5 7baa79f28f46aac82b21bf6d282c34d4
BLAKE2b-256 dc08ba9321f7870e4ea190b323e824a0540f56d3a2f8c476e7f0fdbeac785a9f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page