Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.
Project description
Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.
Tested against Windows 10 / Python 3.11 / Anaconda
pip install multitessiocr
This function takes a path to the Tesseract OCR executable, a list of image paths, URLs,
base64 strings, numpy arrays, bytes or PIL images
and optional Tesseract command line arguments. It uses Tesseract to extract text from
the provided images and returns the results as a pandas DataFrame.
Args:
tesseract_path (str): The path to the Tesseract OCR executable.
allpics (list, tuple): A list of images (image paths, URLs, base64 strings, numpy arrays,
bytes or PIL images) to be processed.
add_after_tesseract_path (str, optional): Additional arguments to pass to Tesseract
after the tesseract executable file path. Defaults to an empty string.
add_at_the_end (str, optional): Additional arguments to append at the end of the
Tesseract command. Defaults to '-l eng --psm 3'.
**kwargs: Additional keyword arguments to control the subprocess execution,
such as 'stdout', 'stderr', 'timeout', etc. See the 'subprocess.run'
documentation for more details.
Returns:
pandas.DataFrame: A DataFrame containing the OCR results with columns:
- 'id_img': Image ID (integer)
- 'id_word': Word ID within the image (integer)
- 'ocr_result': Recognized text (string)
- 'start_x': Starting X-coordinate of the bounding box (integer)
- 'end_x': Ending X-coordinate of the bounding box (integer)
- 'start_y': Starting Y-coordinate of the bounding box (integer)
- 'end_y': Ending Y-coordinate of the bounding box (integer)
- 'conf': Confidence score (integer)
- 'text_group': Group identifier for enumerated groups (integer)
Example:
from multitessiocr import tesser_ocr
df = tesser_ocr(
tesseract_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
allpics=[
"https://m.media-amazon.com/images/I/711y6oE2JrL._SL1500_.jpg",
"https://m.media-amazon.com/images/I/61g+KBpG20L._SL1500_.jpg",
],
add_after_tesseract_path="",
add_at_the_end="-l eng --psm 3",
)
print(df.to_string())
# ...
# 11 1 12 today. 402 498 460 492 96 3072 450 476 96 32 4
# 12 1 13 Wait 551 635 525 556 95 2604 593 540 84 31 5
# 13 1 14 till 645 695 525 556 96 1550 670 540 50 31 5
# 14 1 15 you 705 773 533 565 96 2176 739 549 68 32 5
# 15 1 16 hear 562 645 579 610 95 2573 603 594 83 31 6
# 16 1 17 about 663 767 579 610 96 3224 715 594 104 31 6
# 17 2 1 ART 94 246 125 207 95 12464 170 166 152 82 7
# 18 2 2 OF 275 376 125 207 95 8282 325 166 101 82 7
# 19 2 3 NONVIOLENT 407 907 125 206 96 40500 657 165 500 81 7
# 20 2 4 COMMUNICATION 167 832 296 377 96 53865 499 336 665 81 8
# 21 2 5 TAR 319 379 428 444 31 960 349 436 60 16 9
# ...
Note:
- Images are first loaded, processed, and written to temporary files before OCR.
- OCR results are extracted from the HOCR format output generated by Tesseract.
- The resulting DataFrame contains information about recognized words and their positions.
- The 'text_group' column is used to enumerate groups of related words (same line) within an image.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
multitessiocr-0.11.tar.gz
(60.1 kB
view hashes)
Built Distribution
Close
Hashes for multitessiocr-0.11-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d606e6a320297acc6f5bfb54b0cb318b4d6d71468fb5d4a8604da29aba11d7fd |
|
MD5 | a60d01fa8c85751df705d6f7486e67bb |
|
BLAKE2b-256 | 4b95f61cc95b389d3c215f138d04971f580ae9a1e3171653d8a5a018a7957f51 |