Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.
Project description
Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.
Tested against Windows 10 / Python 3.11 / Anaconda
pip install multitessiocr
from multitessiocr import tesser_ocr
piclist = [
r"C:\screeeni\35.png",
r"C:\Users\hansc\Downloads\2023-11-12 00_48_43-.png",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 12.03.41 PM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.46 AM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.33 AM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.22 AM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 12.03.42 PM.jpeg",
]
df = tesser_ocr(
piclist=piclist,
tesser_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
add_after_tesseract_path="",
add_at_the_end="-l eng+por --psm 3",
processes=5,
chunks=5,
print_stdout=False,
print_stderr=True,
)
# aa_text aa_start_x aa_start_y aa_end_x aa_end_y aa_object aa_type aa_element_index aa_page aa_index aa_language aa_parents aa_all_children aa_direct_children aa_tag aa_x_size aa_x_descenders aa_x_ascenders aa_x_wconf aa_baseline_1 aa_baseline_2 aa_document_index aa_width aa_height aa_area aa_center_x aa_center_y
# 7 Ameta-markup 802 318 922 335 word word 1 1 3 <NA> (4, 5) () () span <NA> <NA> <NA> 77.0 <NA> <NA> 0 120 17 2040 862 326
# 8 language, 933 321 1001 335 word word 1 1 4 <NA> (4, 5) () () span <NA> <NA> <NA> 96.0 <NA> <NA> 0 68 14 952 967 328
# 9 used 1014 318 1050 331 word word 1 1 5 <NA> (4, 5) () () span <NA> <NA> <NA> 96.0 <NA> <NA> 0 36 13 468 1032 324
Perform OCR on a list of images using Tesseract.
Parameters:
- piclist (list): List of image file paths.
- tesser_path (str): Path to the Tesseract executable.
- add_after_tesseract_path (str): Additional parameters to add after the Tesseract path.
- add_at_the_end (str): Additional parameters to add at the end of Tesseract command.
- processes (int): Number of parallel processes for image processing.
- chunks (int): Number of chunks to divide the image list for parallel processing.
- print_stdout (bool): Whether to print standard output during execution.
- print_stderr (bool): Whether to print standard error during execution.
Returns:
- pd.DataFrame: A DataFrame containing parsed OCR results.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
multitessiocr-0.13.tar.gz
(60.8 kB
view details)
Built Distribution
File details
Details for the file multitessiocr-0.13.tar.gz
.
File metadata
- Download URL: multitessiocr-0.13.tar.gz
- Upload date:
- Size: 60.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fe74557a31497fd257383e5f8b62a8aef0f231b9b3293db25b8eecd729c7cc2 |
|
MD5 | 220bbba3cc12d3736bc2e7bbdf1da0d3 |
|
BLAKE2b-256 | db201e750eebcadb3a80646a33832a9ad2773b5cb30a7b726a98b41123ce46a0 |
File details
Details for the file multitessiocr-0.13-py3-none-any.whl
.
File metadata
- Download URL: multitessiocr-0.13-py3-none-any.whl
- Upload date:
- Size: 61.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1284aa512dfef8d66bd73b5ce2ba9ea37c35f82eb55302de550d86192d3aa5e |
|
MD5 | f6b99018ab62d98f35d22b639f7ad291 |
|
BLAKE2b-256 | 942e11515e8a705c386ef4635badbbcf850dac9089da4a708ca6eea1463045d2 |