Solution to extract the text from image and get wordlevel output as dataframe and also to extract text from given boundingBox

Project description

TESSERACT2DICT

This class contains two main funtions:

tess2dict: Input an image and get the extracted text as a dataframe which gives the content, coordinates (x,y,w,h) and confidence of each word. Essentially, it is a wrapper on pytesseract to output a dataframe.
word2text: Once you obtain the dataframe, you can pass it through this function along with a bounding box to get the text inside the given box with proper formatting.

Prerequisites

beautifulsoup4
MakeTreeDir
numpy
opencv-python
pandas
pytesseract

Tesseract Installation

(currently solution works on Tesseract 5.0.0 only)

What is Tesseract?

For Windows

installation link

adding path to path variable (for Tesseract)

For Linux

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Installation

pip install tesseract2dict

Usage

A sample usage of our solution is shown below. Input an image as numpy.ndarray and the extracted dataframe at word level is returned. You can also get the text as plain of a given bounding box with proper formatting using the second function eg:

import cv2
from tesseract2dict import TessToDict
td=TessToDict()
inputImage=cv2.imread('path/to/image.jpg')
### function 1
word_dict=td.tess2dict(inputImage,'out','outfolder')

### function 2
text_plain=td.word2text(word_dict,(0,0,inputImage.shape[1],inputImage.shape[0]))

Authors

Sreekiran A R - Analytics Consultant, AI Labs, Bridgei2i Analytics Solutions - Github , Stackoverflow
Anil Prasad M N - Project Manager, AI Labs, Bridgei2i Analytics Solutions - Github

License

This project is licensed under the MIT License - see the LICENSE.md file for details

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Useful links

http://gwang-cv.github.io/2017/08/25/ubuntu16.04+Tesseract4.0/

Project details

Release history Release notifications | RSS feed

This version

1.3

Jan 17, 2020

1.2

Jan 7, 2020

1.1

Jan 6, 2020

1.0

Dec 31, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tesseract2dict-1.3.tar.gz (4.0 kB view details)

Uploaded Jan 17, 2020 Source

File details

Details for the file tesseract2dict-1.3.tar.gz.

File metadata

Download URL: tesseract2dict-1.3.tar.gz
Upload date: Jan 17, 2020
Size: 4.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.5.2

File hashes

Hashes for tesseract2dict-1.3.tar.gz
Algorithm	Hash digest
SHA256	`2be83aec6233fe24fa42d06436bfe26b29660929632a628c7b8274f209260381`
MD5	`07a4ff6f71a2b3609986b5d60b0e5031`
BLAKE2b-256	`c82f7d535de366f1febd5bc8bbb5aafd8f8210c84a3d688b8a30bfa0ac4b8af4`

See more details on using hashes here.

tesseract2dict 1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta