Package containing utility function for hOCR and tesseract
Project description
hocr_utils
hocr_utils is a package to transform, plot and simplify the use of hOCR files.
Installation
Dependencies
hocr-utils requires:
- Python (>= |3.7|)
Optional Dependencies
The functions to plot, transform pdf into hOCR require the following additional dependencies:
- pytesseract
- pdf2image
- opencv-python
Additionaly tesseract language pack need to be install for non-english ocr.
Example: install french language package on ubuntu with:
apt-get install tesseract-ocr-fra
User Installation
The easiest way to install scikit-learn is using pip
:
pip install -U hocr_utils
Usecases
Transform PIL Images to hOCR
Requires pytesseract
dependency and the requested tesseract language pack.
from hocr_utils import utils
from PIL import Image
image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])
Transform pdf to hOCR
Requires pytesseract
, pdf2image
dependencies as well as the requested tesseract language pack.
from hocr_utils import utils
hocr = utils.pdf_to_hocr('./data/sample.pdf')
Transform hOCR to list of dictionary
from hocr_utils import utils
hocr_dict = utils.hocr_to_dict(hocr)
This can then be transformed into pandas dataFrame using pd.dataFrame.from_records(hocr_dict)
By default there will be one entry per line in the hOCR, However it is possible to group the list by ['paragraph', 'line', 'word']
using by
argument.
Get a single page from hOCR
from hocr_utils import utils
hocr_1 = utils.get_page(hocr, 1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hocr_utils-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27420555a2efe72fa0d2468df121461b072f167d19fcfe88d955aaa30ba3ffb9 |
|
MD5 | 93ca97f488fd0d78c2c5a8496dcbab58 |
|
BLAKE2b-256 | dc108627bb7bc0379b18f70591b182d902c819fccfd3724158a1eb6429da9d1d |