Skip to main content

solution to extract the text from image and get wordlevel output as dataframe

Project description

TESSERACT2DICT

Input an image and get the extracted text as a dataframe which gives the content, coordinates (x,y,w,h) and confidence of each word. Essentially, it is a wrapper on pytesseract to output a dataframe.

Prerequisites

  • beautifulsoup4
  • MakeTreeDir
  • numpy
  • opencv-python
  • pandas
  • pytesseract

Tesseract Installation

(currently solution works on Tesseract 5.0.0 only)

What is Tesseract?

For Windows

adding path to path variable (for Tesseract)

For Linux

  • sudo apt install tesseract-ocr
  • sudo apt install libtesseract-dev

Installation

pip install tesseract2dict

Usage

A sample usage of our solution is shown below. Input an image as numpy.ndarray and the extracted dataframe at word level is returned.

eg:

import cv2
from tesseract2dict import TessToDict
td=TessToDict()
inputImage=cv2.imread('path/to/image.jpg')
word_dict=td.tess2dict(inputImage,'out','outfolder')

Authors

  • Sreekiran A R - Analytics Consultant, AI Labs, Bridgei2i Analytics Solutions - Github , Stackoverflow

  • Anil Prasad M N - Project Manager, AI Labs, Bridgei2i Analytics Solutions - Github

License

This project is licensed under the MIT License - see the LICENSE.md file for details

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Useful links

  1. http://gwang-cv.github.io/2017/08/25/ubuntu16.04+Tesseract4.0/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for tesseract2dict, version 1.0
Filename, size File type Python version Upload date Hashes
Filename, size tesseract2dict-1.0.tar.gz (3.2 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page