Skip to main content

A Helper class to get more meaninful text out of common OCR outputs

Project description

OcrLayout Library

Provides the ability to get more meaninful text out of common OCR outputs. It manipulates the Bounding Boxes of lines to rebuild a page layout to approximate human-reading experience.

Problem Statement

While OCR processing images containing lots of textual information, it becomes relevant to assemble the generated text into meaninful lines of text combining related paragraphs or sentences.

Another way to see would be to cluster the lines of text based on their positions/coordinates in the original content.

More meaningfull output for what?

  • Text Analytics you may leverage any Text Analytics such as Key Phrases, Entities Extraction with more confidence of its outcome
  • Accessibility : Any infographic becomes alive, overcoming the alt text feature.
  • Modern browser Read Aloud feature : it becomes easier to build solutions to read aloud an image, increasing verbal narrative of visual information.
  • Machine Translation : get more accurate MT output as you can retain more context.
  • Sentences/Paragraph Classification : from scanned-base images i.e. contracts, having a more meaninful textual output allows you to classify it at a granular level in terms of risk, personal clause or conditions.

Ocr Output Support

Today bboxhelper supports the output of

AZURE

GOOGLE

AWS - Detect Document Text

BBoxHelper - Get Started

More information to get started can be found documentation of this repository: documentation.

Known Limitations

More information on known limitations.

Upcoming improvements

Release History

0.8 (2020-08-24)

  • Support for AWS Detect Document Text
  • Google support refactored for consistency
  • Simplify the bboxtester script

0.7 (2020-07-29)

  • Configurable merge line character (default is a single space)

0.6 (2020-07-09)

  • Support for Azure OCR API

0.5 (2020-06-07)

  • Fix line/word X alignment
  • Improved sorting with with clusters within clusters support
  • added words_count to each line
  • removed dependency on OpenCV and Pillow

0.4.2 (2020-06-06)

  • Remove file logging as default

0.4.1 (2020-06-01)

  • Comment the determine_ppi method as unstable

0.4 (2020-05-31)

  • Bounding boxes rotation improvements
  • Fix issues with inch unit support

0.3 (2020-05-23)

  • Refactoring variables names
  • Improvement on the end of blocks handling for generating the final text attribute

0.2 (2020-05-22 Afternoon)

0.1 (2020-05-22 Morning)

  • Initial release

Disclaimer

THIS CODE IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR NON-INFRINGEMENT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrlayout-0.8.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

ocrlayout-0.8-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file ocrlayout-0.8.tar.gz.

File metadata

  • Download URL: ocrlayout-0.8.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for ocrlayout-0.8.tar.gz
Algorithm Hash digest
SHA256 e1fe70411832eaa3f2533cd1ba076cbf9559a6b48f1584d3ec9a2ce15375a70a
MD5 d2ef4034b116bc4ea58d525bfb132394
BLAKE2b-256 5553e250bb0d5064e1d7913cc6b59d56aade47d95e8fd41fbb7a537c90cf0962

See more details on using hashes here.

File details

Details for the file ocrlayout-0.8-py3-none-any.whl.

File metadata

  • Download URL: ocrlayout-0.8-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for ocrlayout-0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f19d6d753b27b3a88bc1a4156b0bdd2a1a6c93bd59b95abdf48d456cf3d136b3
MD5 7566b2b8d2ed48117ab6d8ab4cfcaee5
BLAKE2b-256 783917a41023a2bde7d9738506df2f3a3ce2604b5a0dec10384c4d618db02adb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page