A Helper class to get more meaninful text out of common OCR outputs
Project description
OcrLayout Library
Provides the ability to get more meaninful text out of common OCR outputs. It manipulates the Bounding Boxes of lines to rebuild a page layout to approximate human-reading experience.
Problem Statement
While OCR processing images containing lots of textual information, it becomes relevant to assemble the generated text into meaninful lines of text combining related paragraphs or sentences.
Another way to see would be to cluster the lines of text based on their positions/coordinates in the original content.
More meaningfull output for what?
- Text Analytics you may leverage any Text Analytics such as Key Phrases, Entities Extraction with more confidence of its outcome
- Accessibility : Any infographic becomes alive, overcoming the alt text feature.
- Modern browser Read Aloud feature : it becomes easier to build solutions to read aloud an image, increasing verbal narrative of visual information.
- Machine Translation : get more accurate MT output as you can retain more context.
- Sentences/Paragraph Classification : from scanned-base images i.e. contracts, having a more meaninful textual output allows you to classify it at a granular level in terms of risk, personal clause or conditions.
Ocr Output Support
Today bboxhelper supports the output of
AZURE
- Azure Batch Read API response. https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text#read-api
- Azure Computer Vision SDK Python Sample https://github.com/Azure/azure-sdk-for-python/tree/76a0d91c32a79561a7d5666e421908e7c4cffc6a/sdk/cognitiveservices/azure-cognitiveservices-vision-computervision
- Google Vision API Detect Text https://cloud.google.com/vision/docs/ocr https://cloud.google.com/vision/docs/ocr#vision_text_detection-python
- Google Vision Python Sample https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/vision/cloud-client/document_text/doctext.py
AWS - Detect Document Text
- AWS Textract Detects text in the input document. Amazon Textract can detect lines of text and the words that make up a line of text. The input document must be an image in JPEG or PNG format. https://aws.amazon.com/textract/features/ https://docs.aws.amazon.com/textract/latest/dg/how-it-works-detecting.html https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html
BBoxHelper - Get Started
More information to get started can be found documentation of this repository: documentation.
Known Limitations
More information on known limitations.
Upcoming improvements
- hOCR Suppport https://en.wikipedia.org/wiki/HOCR tools
- asyncio support for pages processing
- Google OCR for documents (PDF)
- AWS OCR for documents (PDF)
Release History
0.8 (2020-08-24)
- Support for AWS Detect Document Text
- Google support refactored for consistency
- Simplify the bboxtester script
0.7 (2020-07-29)
- Configurable merge line character (default is a single space)
0.6 (2020-07-09)
- Support for Azure OCR API
0.5 (2020-06-07)
- Fix line/word X alignment
- Improved sorting with with clusters within clusters support
- added words_count to each line
- removed dependency on OpenCV and Pillow
0.4.2 (2020-06-06)
- Remove file logging as default
0.4.1 (2020-06-01)
- Comment the determine_ppi method as unstable
0.4 (2020-05-31)
- Bounding boxes rotation improvements
- Fix issues with inch unit support
0.3 (2020-05-23)
- Refactoring variables names
- Improvement on the end of blocks handling for generating the final text attribute
0.2 (2020-05-22 Afternoon)
- Change to fit the new Azure Computer Vision SDK 0.6.0 breaking changes.
0.1 (2020-05-22 Morning)
- Initial release
Disclaimer
THIS CODE IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR NON-INFRINGEMENT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ocrlayout-0.8.tar.gz
.
File metadata
- Download URL: ocrlayout-0.8.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1fe70411832eaa3f2533cd1ba076cbf9559a6b48f1584d3ec9a2ce15375a70a |
|
MD5 | d2ef4034b116bc4ea58d525bfb132394 |
|
BLAKE2b-256 | 5553e250bb0d5064e1d7913cc6b59d56aade47d95e8fd41fbb7a537c90cf0962 |
File details
Details for the file ocrlayout-0.8-py3-none-any.whl
.
File metadata
- Download URL: ocrlayout-0.8-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f19d6d753b27b3a88bc1a4156b0bdd2a1a6c93bd59b95abdf48d456cf3d136b3 |
|
MD5 | 7566b2b8d2ed48117ab6d8ab4cfcaee5 |
|
BLAKE2b-256 | 783917a41023a2bde7d9738506df2f3a3ce2604b5a0dec10384c4d618db02adb |