Package containing utility function for hOCR and tesseract
Project description
hocr_utils
hocr_utils is a package to transform, plot and simplify the use of hOCR files.
Installation
Dependencies
hocr-utils requires:
- Python (>= |3.7|)
Optional Dependencies
The functions to plot, transform pdf into hOCR require the following additional dependencies:
- pytesseract
- pdf2image
- opencv-python
Additionaly tesseract language pack need to be install for non-english ocr.
Example: install french language package on ubuntu with:
apt-get install tesseract-ocr-fra
User Installation
The easiest way to install scikit-learn is using pip:
pip install -U hocr_utils
Usecases
Transform PIL Images to hOCR
Requires pytesseract dependency and the requested tesseract language pack.
from hocr_utils import utils
from PIL import Image
image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])
Transform pdf to hOCR
Requires pytesseract, pdf2image dependencies as well as the requested tesseract language pack.
from hocr_utils import utils
hocr = utils.pdf_to_hocr('./data/sample.pdf')
Transform hOCR to list of dictionary
from hocr_utils import utils
hocr_dict = utils.hocr_to_dict(hocr)
This can then be transformed into pandas dataFrame using pd.dataFrame.from_records(hocr_dict)
By default there will be one entry per line in the hOCR, However it is possible to group the list by ['paragraph', 'line', 'word'] using by argument.
Get a single page from hOCR
from hocr_utils import utils
hocr_1 = utils.get_page(hocr, 1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hocr_utils-0.0.3.tar.gz.
File metadata
- Download URL: hocr_utils-0.0.3.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3874d8a19d318402004107deb3d2c80fed3197aba44fe2cde7f14275ba4c695
|
|
| MD5 |
9525d26ba5d97ba8720302e0cfdc0cd1
|
|
| BLAKE2b-256 |
4d2be6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04
|
File details
Details for the file hocr_utils-0.0.3-py3-none-any.whl.
File metadata
- Download URL: hocr_utils-0.0.3-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27420555a2efe72fa0d2468df121461b072f167d19fcfe88d955aaa30ba3ffb9
|
|
| MD5 |
93ca97f488fd0d78c2c5a8496dcbab58
|
|
| BLAKE2b-256 |
dc108627bb7bc0379b18f70591b182d902c819fccfd3724158a1eb6429da9d1d
|