Skip to main content

OCRticle - Structured OCR for articles

Project description

OCRticle

GUI application capable of extracting text from an image, while keeping the text's original structure. OCRticle works best with articles (hence the name), but it should function with any kind of text.

Installation

OCRticle requires Python >= 3.10 and the Tesseract OCR engine. Instructions for installing Tesseract can be found here. Currently, OCRticle supports OCR for text in English and in Portuguese. For detecting text in Portuguese, the corresponding Tesseract language pack must be installed.

Linux users should be able to install OCRticle by just running:

pip install ocrticle

For Windows users, a pre-compiled binary can be downloaded from here.

Usage

When invoked from the command line, OCRticle can be given an image path as an optional argument. Otherwise, the application will ask the user to select an image from the computer.

After an image has been selected, the next window of the application allows the user to draw rectangles to select the articles present in the image. Each rectangle or group of intersecting rectangles should correspond to one article. Alternatively, the user can draw rectangles to exclude certain parts of the image from being scanned by selecting the corresponding option. There are also options to change the brightness, contrast and saturation of the source image. For example, to convert an image to black and white, the saturation can be set to 0.

Once this step is concluded, OCRticle will use Tesseract to scan the image or the selected rectangles and extract the text found in each. Then, OCRticle will try to automatically join the text in different blocks. This behavior is largely dependent on Tesseract's detection, which is not perfect, therefore the blocks may not exactly match the original text.

The user is able to tag each block, according to its contents. These tags are: Title, Text, Quote, and Code. OCRticle will try to automatically "guess" which block is the article's title, but, if it fails, the user can fix the mislabeling. OCRticle will also remove some line breaks, when it believes that two lines are part of the same paragraph. This behavior can be toggled during this step.

Finally, the user can save the scanned articles in a Markdown file, where the blocks will be identified according to their tags.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrticle-0.2.0.tar.gz (5.1 MB view details)

Uploaded Source

Built Distribution

ocrticle-0.2.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file ocrticle-0.2.0.tar.gz.

File metadata

  • Download URL: ocrticle-0.2.0.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.28.1

File hashes

Hashes for ocrticle-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3ca153ae0c0bf9a5602b3ea6e6f1edf6e3ed117825632e93d8d956ab8741abb3
MD5 c6f983676e200ea7b1c73629ffccb88b
BLAKE2b-256 8bc230a022fabf79101c34dac81cda50c8584a42777a2f245375cf07eeef2631

See more details on using hashes here.

File details

Details for the file ocrticle-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ocrticle-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.28.1

File hashes

Hashes for ocrticle-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d9356098b3e1ebbf03dbe845803b231ef05c0cb48d715bd5384386bdf75bbda
MD5 abce05b1bbee8d1b0fd8adc781bcea8d
BLAKE2b-256 39103c0824b116a10aab4bc8c9cb1355a7a4f4db29e72bf4c9dcd9788343189c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page