Project description

OCR_with_format

simple wrapper that postprocesses pytesseract's hOCR output to maintain format and spacings.
Link to an alternative implementation found on stackoverflow

How to

install python -m pip install OCR_with_format
see usage OCR_with_format --help (executing with python -m is not supported)

Usage

NAME
    OCR_with_format

SYNOPSIS
    OCR_with_format IMG_PATH THRESHOLDING_METHOD <flags>

POSITIONAL ARGUMENTS
    IMG_PATH
        Type: str
        path to the image you want to do OCR on
    THRESHOLDING_METHOD
        Type: str
        any from "otsu", "otsu_gaussian", "adaptative_gaussian", "all"

        If "all", the three methods will be tried and the final output will be the one which maximizes the mean and median confidences over each parsed words.

FLAGS
    -l, --language=LANGUAGE
        Type: str
        Default: 'eng'
        language to look for in the image
    -o, --output_path=OUTPUT_PATH
        Type: Optional[str]
        Default: None
        if not None, will output to this path and erase its previous content.
    -t, --tesseract_args=TESSERACT_ARGS
        Type: str
        Default: '-...
        default arguments for tesseract
    -q, --quiet=QUIET
        Default: False
        if True, will only print the output and no logs
    -c, --comparison_run=COMPARISON_RUN
        Default: False
        if True, will just output the raw output from pytesseract. This can be used to convince yourself of the usefullness of this project.

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

Example

Image:
output from OCR_with_format ./screenshot.png --thresholding_method="all" --quiet

                                                                    @Unwateh (1) ~        Fork (3)                  (©
    OCR_with_format                          [     Pir ][                   | [  &            <) [     s        -]
                                                                                          About
                                                                                                                              &
                                                                                          Wrapper around pytesseract to
¥ Branches  © Tags                                                                     postprocess in a way that preserves
                                                                                          spacing and formattings.
  i  thiswillbeyourgithub addded license                        10 minutes ago  O 4
                                                                                          &5 GPL-3.0 license
  @  LICENSE            addded license                             10 minutes ago    - Activity
  [u]  __init__.py           minor                                      11 minutes ago     ¢ Ostars
  [u]  requirements.txt       added empty requirements                  11 minutes ago    <& 1 watching
                                                                                          Y  Oforks
  Help people interested in this repository understand your project by
  adding a README.                                                                      Releases
                                                                                          Create No releases a new published release
                                                                                          Packages
                                                                                          No packages published
                                                                                          Publish your first package
                                                                                          Languages
                                                                                          ———
                                                                                           ® Python 100.0%

output from OCR_with_format ./screenshot.png --quiet --comparison_run *

OCR_with_format

[ pin | [ @unwateh (@) ~ | [ & Fork (O)

-] [ ¢ s (0

¥ main ~

¥ Branches © Tags

Wrapper around pytesseract to
postprocess in a way that preserves

. . . spacing and formattings.
- thiswillbeyourgithub addded license 10 minutes ago 'O 4
&5 GPL-3.0 license
@ LICENSE addded license 10 minutes ago A Activity
0O _init__py minor 11 minutes ago ¢ Ostars
O requirements.txt added empty requirements 11 minutes ago | @ 1watching
% 0forks
Help people interested in this repository understand your project by
adding a README. Releases

No releases published
Create a new release

Packages

No packages published
Publish your first package

Languages

————
@ Python 100.0%

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12

Jul 6, 2024

0.9

Jul 26, 2023

0.8

Jul 25, 2023

This version

0.7

Jul 25, 2023

0.6

Jul 25, 2023

0.5

Jul 25, 2023

0.4

Jul 25, 2023

0.3

Jul 25, 2023

0.2

Jul 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

OCR_with_format-0.7.tar.gz (44.7 kB view hashes)

Uploaded Jul 25, 2023 Source

Built Distribution

OCR_with_format-0.7-py3-none-any.whl (31.8 kB view hashes)

Uploaded Jul 25, 2023 Python 3

Hashes for OCR_with_format-0.7.tar.gz

Hashes for OCR_with_format-0.7.tar.gz
Algorithm	Hash digest
SHA256	`61a268c3e288ef8c6c3adba8b1ceec0fbd5379592ef5f895326ea65047fa738c`
MD5	`b834a04828b8fd890a4410582d22239f`
BLAKE2b-256	`276578e9e256596db36b3f8ed79533b87e19596aa743fc779ee14f35193e23bd`

Hashes for OCR_with_format-0.7-py3-none-any.whl

Hashes for OCR_with_format-0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8db91fdb5175fd615c191b9c9796c09a9d28e8e9f9a044165e0bdf40c3160cf1`
MD5	`ecbb8b6dce64e8334f8807777d1bbe54`
BLAKE2b-256	`d86822ff8887ba2f87c81885bdc0a043c2c41fa9874162c9b7c5a8a282a6b7ca`