Package containing utility function for hOCR and tesseract
Project description
hocr_utils
hocr_utils is a package to transform, plot and simplify the use of hOCR files.
Installation
Dependencies
hocr-utils requires:
- Python (>= |3.7|)
Optional Dependencies
The functions to plot, transform pdf into hOCR require the following additional dependencies:
- pytesseract
- pdf2image
- opencv-python
Additionaly tesseract language pack need to be install for non-english ocr.
Example: install french language package on ubuntu with:
apt-get install tesseract-ocr-fra
User Installation
The easiest way to install scikit-learn is using pip
:
pip install -U hocr_utils
Usecases
Transform PIL Images to hOCR
Requires pytesseract
dependency and the requested tesseract language pack.
from hocr_utils import utils
from PIL import Image
image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])
Transform pdf to hOCR
Requires pytesseract
, pdf2image
dependencies as well as the requested tesseract language pack.
from hocr_utils import utils
hocr = utils.pdf_to_hocr('./data/sample.pdf')
Transform hOCR to list of dictionary
from hocr_utils import utils
hocr_dict = utils.hocr_to_dict(hocr)
This can then be transformed into pandas dataFrame using pd.dataFrame.from_records(hocr_dict)
By default there will be one entry per line in the hOCR, However it is possible to group the list by ['paragraph', 'line', 'word']
using by
argument.
Get a single page from hOCR
from hocr_utils import utils
hocr_1 = utils.get_page(hocr, 1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hocr_utils-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3cd81426808a854ad1150b8a3aeb1035bd85fac74c88ddabb5d2f9b5463ebfa |
|
MD5 | 7baa79f28f46aac82b21bf6d282c34d4 |
|
BLAKE2b-256 | dc08ba9321f7870e4ea190b323e824a0540f56d3a2f8c476e7f0fdbeac785a9f |