Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.

Tested against Windows 10 / Python 3.11 / Anaconda

pip install multitessiocr

This function takes a path to the Tesseract OCR executable, a list of image paths, URLs,
base64 strings, numpy arrays, bytes or PIL images
and optional Tesseract command line arguments. It uses Tesseract to extract text from
the provided images and returns the results as a pandas DataFrame.

Args:
	tesseract_path (str): The path to the Tesseract OCR executable.
	allpics (list, tuple): A list of images (image paths, URLs, base64 strings, numpy arrays,
							bytes or PIL images) to be processed.
	add_after_tesseract_path (str, optional): Additional arguments to pass to Tesseract
		after the tesseract executable file path. Defaults to an empty string.
	add_at_the_end (str, optional): Additional arguments to append at the end of the
		Tesseract command. Defaults to '-l eng --psm 3'.
	**kwargs: Additional keyword arguments to control the subprocess execution,
		such as 'stdout', 'stderr', 'timeout', etc. See the 'subprocess.run'
		documentation for more details.

Returns:
	pandas.DataFrame: A DataFrame containing the OCR results with columns:
		- 'id_img': Image ID (integer)
		- 'id_word': Word ID within the image (integer)
		- 'ocr_result': Recognized text (string)
		- 'start_x': Starting X-coordinate of the bounding box (integer)
		- 'end_x': Ending X-coordinate of the bounding box (integer)
		- 'start_y': Starting Y-coordinate of the bounding box (integer)
		- 'end_y': Ending Y-coordinate of the bounding box (integer)
		- 'conf': Confidence score (integer)
		- 'text_group': Group identifier for enumerated groups (integer)

Example:
	from multitessiocr import tesser_ocr
	df = tesser_ocr(
		tesseract_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
		allpics=[
			"https://m.media-amazon.com/images/I/711y6oE2JrL._SL1500_.jpg",
			"https://m.media-amazon.com/images/I/61g+KBpG20L._SL1500_.jpg",
		],
		add_after_tesseract_path="",
		add_at_the_end="-l eng --psm 3",
	)
	print(df.to_string())
	# ...
	# 11       1       12         today.      402    498      460    492    96       3072       450       476     96      32           4
	# 12       1       13           Wait      551    635      525    556    95       2604       593       540     84      31           5
	# 13       1       14           till      645    695      525    556    96       1550       670       540     50      31           5
	# 14       1       15            you      705    773      533    565    96       2176       739       549     68      32           5
	# 15       1       16           hear      562    645      579    610    95       2573       603       594     83      31           6
	# 16       1       17          about      663    767      579    610    96       3224       715       594    104      31           6
	# 17       2        1            ART       94    246      125    207    95      12464       170       166    152      82           7
	# 18       2        2             OF      275    376      125    207    95       8282       325       166    101      82           7
	# 19       2        3     NONVIOLENT      407    907      125    206    96      40500       657       165    500      81           7
	# 20       2        4  COMMUNICATION      167    832      296    377    96      53865       499       336    665      81           8
	# 21       2        5            TAR      319    379      428    444    31        960       349       436     60      16           9
	# ...


Note:
	- Images are first loaded, processed, and written to temporary files before OCR.
	- OCR results are extracted from the HOCR format output generated by Tesseract.
	- The resulting DataFrame contains information about recognized words and their positions.
	- The 'text_group' column is used to enumerate groups of related words (same line) within an image.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.13

Nov 14, 2023

0.12

Nov 14, 2023

0.11

Sep 17, 2023

This version

0.10

Sep 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multitessiocr-0.10.tar.gz (60.0 kB view hashes)

Uploaded Sep 17, 2023 Source

Built Distribution

multitessiocr-0.10-py3-none-any.whl (61.3 kB view hashes)

Uploaded Sep 17, 2023 Python 3

Hashes for multitessiocr-0.10.tar.gz

Hashes for multitessiocr-0.10.tar.gz
Algorithm	Hash digest
SHA256	`1aa4bdd0b59df67d3b62cc5d2fd2fadf7c715c847f08b454b1210c53f111ee76`
MD5	`16d880756ffa4163db41f55c11577b90`
BLAKE2b-256	`467702ffcc5647dbcf0d2195ffc065a973b0e6bb5bc1f010fba147e1c1a8fccd`

Hashes for multitessiocr-0.10-py3-none-any.whl

Hashes for multitessiocr-0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1438f5154b2cca4f9fd6a0d87959bcc2c6b605bd0c2d7768590ce1d20ecac0df`
MD5	`969ba92ba176fefe76b66fbd8986a01c`
BLAKE2b-256	`449e4284058f724d759f1051d60773583e6bc9b5f6a3e1848002844470eadb57`