solution to extract the text from image and get wordlevel output as dataframe
Project description
TESSERACT2DICT
Input an image and get the extracted text as a dataframe which gives the content, coordinates (x,y,w,h) and confidence of each word. Essentially, it is a wrapper on pytesseract to output a dataframe.
Prerequisites
- beautifulsoup4
- MakeTreeDir
- numpy
- opencv-python
- pandas
- pytesseract
Tesseract Installation
(currently solution works on Tesseract 5.0.0 only)
For Windows
adding path to path variable (for Tesseract)
For Linux
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
Installation
pip install tesseract2dict
Usage
A sample usage of our solution is shown below. Input an image as numpy.ndarray and the extracted dataframe at word level is returned.
eg:
import cv2
from tesseract2dict import TessToDict
td=TessToDict()
inputImage=cv2.imread('path/to/image.jpg')
word_dict=td.tess2dict(inputImage,'out','outfolder')
Authors
-
Sreekiran A R - Analytics Consultant, AI Labs, Bridgei2i Analytics Solutions - Github , Stackoverflow
-
Anil Prasad M N - Project Manager, AI Labs, Bridgei2i Analytics Solutions - Github
License
This project is licensed under the MIT License - see the LICENSE.md file for details
NOTE: This software depends on other packages that may be licensed under different open source licenses.
Useful links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.