Skip to main content

Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.

Project description

Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.

Tested against Windows 10 / Python 3.11 / Anaconda

pip install multitessiocr

This function takes a path to the Tesseract OCR executable, a list of image paths, URLs,
base64 strings, numpy arrays, bytes or PIL images
and optional Tesseract command line arguments. It uses Tesseract to extract text from
the provided images and returns the results as a pandas DataFrame.

Args:
	tesseract_path (str): The path to the Tesseract OCR executable.
	allpics (list, tuple): A list of images (image paths, URLs, base64 strings, numpy arrays,
							bytes or PIL images) to be processed.
	add_after_tesseract_path (str, optional): Additional arguments to pass to Tesseract
		after the tesseract executable file path. Defaults to an empty string.
	add_at_the_end (str, optional): Additional arguments to append at the end of the
		Tesseract command. Defaults to '-l eng --psm 3'.
	**kwargs: Additional keyword arguments to control the subprocess execution,
		such as 'stdout', 'stderr', 'timeout', etc. See the 'subprocess.run'
		documentation for more details.

Returns:
	pandas.DataFrame: A DataFrame containing the OCR results with columns:
		- 'id_img': Image ID (integer)
		- 'id_word': Word ID within the image (integer)
		- 'ocr_result': Recognized text (string)
		- 'start_x': Starting X-coordinate of the bounding box (integer)
		- 'end_x': Ending X-coordinate of the bounding box (integer)
		- 'start_y': Starting Y-coordinate of the bounding box (integer)
		- 'end_y': Ending Y-coordinate of the bounding box (integer)
		- 'conf': Confidence score (integer)
		- 'text_group': Group identifier for enumerated groups (integer)

Example:
	from multitessiocr import tesser_ocr
	df = tesser_ocr(
		tesseract_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
		allpics=[
			"https://m.media-amazon.com/images/I/711y6oE2JrL._SL1500_.jpg",
			"https://m.media-amazon.com/images/I/61g+KBpG20L._SL1500_.jpg",
		],
		add_after_tesseract_path="",
		add_at_the_end="-l eng --psm 3",
	)
	print(df.to_string())
	# ...
	# 11       1       12         today.      402    498      460    492    96       3072       450       476     96      32           4
	# 12       1       13           Wait      551    635      525    556    95       2604       593       540     84      31           5
	# 13       1       14           till      645    695      525    556    96       1550       670       540     50      31           5
	# 14       1       15            you      705    773      533    565    96       2176       739       549     68      32           5
	# 15       1       16           hear      562    645      579    610    95       2573       603       594     83      31           6
	# 16       1       17          about      663    767      579    610    96       3224       715       594    104      31           6
	# 17       2        1            ART       94    246      125    207    95      12464       170       166    152      82           7
	# 18       2        2             OF      275    376      125    207    95       8282       325       166    101      82           7
	# 19       2        3     NONVIOLENT      407    907      125    206    96      40500       657       165    500      81           7
	# 20       2        4  COMMUNICATION      167    832      296    377    96      53865       499       336    665      81           8
	# 21       2        5            TAR      319    379      428    444    31        960       349       436     60      16           9
	# ...


Note:
	- Images are first loaded, processed, and written to temporary files before OCR.
	- OCR results are extracted from the HOCR format output generated by Tesseract.
	- The resulting DataFrame contains information about recognized words and their positions.
	- The 'text_group' column is used to enumerate groups of related words (same line) within an image.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multitessiocr-0.10.tar.gz (60.0 kB view hashes)

Uploaded Source

Built Distribution

multitessiocr-0.10-py3-none-any.whl (61.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page