Package containing utility function for hOCR and tesseract
Project description
hocr_utils
hocr_utils is a package to transform, plot and simplify the use of hOCR files.
Installation
Dependencies
hocr-utils requires:
- Python (>= |3.7|)
Optional Dependencies
The functions to plot, transform pdf into hOCR require the following additional dependencies:
- pytesseract
- pdf2image
- opencv-python
Additionaly tesseract language pack need to be install for non-english ocr.
Example: install french language package on ubuntu with:
apt-get install tesseract-ocr-fra
User Installation
The easiest way to install scikit-learn is using pip
:
pip install -U hocr_utils
Usecases
Transform PIL Images to hOCR
Requires pytesseract
dependency and the requested tesseract language pack.
from hocr_utils import utils
from PIL import Image
image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])
Transform pdf to hOCR
Requires pytesseract
, pdf2image
dependencies as well as the requested tesseract language pack.
from hocr_utils import utils
hocr = utils.pdf_to_hocr('./data/sample.pdf')
Transform hOCR to list of dictionary
from hocr_utils import utils
hocr_dict = utils.hocr_to_dict(hocr)
This can then be transformed into pandas dataFrame using pd.dataFrame.from_records(hocr_dict)
By default there will be one entry per line in the hOCR, However it is possible to group the list by ['paragraph', 'line', 'word']
using by
argument.
Get a single page from hOCR
from hocr_utils import utils
hocr_1 = utils.get_page(hocr, 1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hocr_utils-0.0.1.tar.gz
.
File metadata
- Download URL: hocr_utils-0.0.1.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9460e12c179a2998f6216c11fcc762b478e014903449156d4af900bf1dc59ab |
|
MD5 | 63ecf44b7407423d083e452792e9377a |
|
BLAKE2b-256 | b2d17f7bed6a347954db468ef32d7052daee33d4eaece0aaf724dd03de414eda |
File details
Details for the file hocr_utils-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: hocr_utils-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3cd81426808a854ad1150b8a3aeb1035bd85fac74c88ddabb5d2f9b5463ebfa |
|
MD5 | 7baa79f28f46aac82b21bf6d282c34d4 |
|
BLAKE2b-256 | dc08ba9321f7870e4ea190b323e824a0540f56d3a2f8c476e7f0fdbeac785a9f |