Skip to main content

Tool for using different OCR engines and process their results using common data structures.

Project description

mim_ocr

project_goal

The goal of this project is to create a robust and reliable Python library that will be able handle OCR tasks. Several capabilities and features are envisioned:

  • Running OCR with different tools such as
    • Tesseract (local)
    • Google Cloud Vision (cloud)
    • AWS OCR (cloud)
    • EasyOCR (local)
  • Image Preprocessing, such as:
    • rotation
    • reorientation
  • Return OCR results in common data structures
  • Finding features in OCR-ed images using
    • regular expressions
    • keyword lists
    • NLP models
  • OCR Result visulization
  • Running OCR on large data
    • parallelization
    • usage of GPU
  • Detecting various features in OCR results

The project was started in the context of manipulating medical data, but is planned to be used in other fields as well.

Rules for developers

  • create automatic tests for your features
  • When providing example images for your tests please strip them from personal data. Please also verify that you have permission of the image owner

Usage

Requirements

Required python version 3.9 or 3.10. Additional required system packages (tested on ubuntu 20.04):

  • libgl1
  • libglib2.0-0
  • tesseract-ocr
  • poppler-utils
  • protobuf-compiler Useful requirements:
  • tesseract-ocr-pol (As Tesseract is set by default to Polish language)

The complete setup pipeline starting from raw ubuntu docker image is described in build/distribution/test_distribution.Dockerfile

Installation

python -m pip install mim-ocr

Running

To run Google OCR locally (both for running and tests) You need to store in local (not commited to git) files a key to Google service account in JSON format. The path to this file should be set as a GOOGLE_APPLICATION_CREDENTIALS variable.

To run AWS Textract locally (both for running and tests) You need to have properly configured AWS credentials prefferably using environment variables:

Some features might require creation of a config YAML file with local paths or parameters. Examples of such features:

  • Keyword Features finder (requires path to directory where local hyperscan databases are/should be located). If you want to use such a config file, the path to this file should be set in MIM_OCR_CONFIG_PATH environmental variable.

Example working values can be found at config/test_mim_ocr_conf.yaml.

Additional Features

Additional features are tested only on python 3.9

NER_FEATURE

see NER Feature Readme

More Information

Licence

MIT License. See LICENSE.txt for details.

For Maintainers

Building new version of the package: run python3.9 -m build from the main folder.

Uploading new version to pypi: twine upload dist/* (two newly created files).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mim_ocr-0.0.10.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

mim_ocr-0.0.10-py3-none-any.whl (41.9 kB view details)

Uploaded Python 3

File details

Details for the file mim_ocr-0.0.10.tar.gz.

File metadata

  • Download URL: mim_ocr-0.0.10.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for mim_ocr-0.0.10.tar.gz
Algorithm Hash digest
SHA256 858f6405f19378e5604411d993bc0ad61083018fe718cf6eb3f231226c931e4e
MD5 8301c44ff24b8b3aa9aa1784140cbefa
BLAKE2b-256 a748b729890c5fb54f24a8196c140223fb18a5209c8734899021bf1811f0afb2

See more details on using hashes here.

File details

Details for the file mim_ocr-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: mim_ocr-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for mim_ocr-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 5d733659b6d0c8b95395dcc97022905d17e05e710f5b54fcf99cb42ad4a48f22
MD5 3bac819201cd1aacb73969513b63f281
BLAKE2b-256 b30e03327bb90aa3ada8d4789bb99e4d2b2fbe610e1207db4b2e58b13b23408d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page