Tool for using different OCR engines and process their results using common data structures.
Project description
mim_ocr
project_goal
The goal of this project is to create a robust and reliable Python library that will be able handle OCR tasks. Several capabilities and features are envisioned:
- Running OCR with different tools such as
- Tesseract (local)
- Google Cloud Vision (cloud)
- AWS OCR (cloud)
- EasyOCR (local)
- Image Preprocessing, such as:
- rotation
- reorientation
- Return OCR results in common data structures
- Finding features in OCR-ed images using
- regular expressions
- keyword lists
- NLP models
- OCR Result visulization
- Running OCR on large data
- parallelization
- usage of GPU
- Detecting various features in OCR results
The project was started in the context of manipulating medical data, but is planned to be used in other fields as well.
Rules for developers
- create automatic tests for your features
- When providing example images for your tests please strip them from personal data. Please also verify that you have permission of the image owner
Usage
Requirements
Required python version 3.9 or 3.10. Additional required system packages (tested on ubuntu 20.04):
- libgl1
- libglib2.0-0
- tesseract-ocr
- poppler-utils
- protobuf-compiler Useful requirements:
- tesseract-ocr-pol (As Tesseract is set by default to Polish language)
The complete setup pipeline starting from raw ubuntu docker image is described in build/distribution/test_distribution.Dockerfile
Installation
python -m pip install mim-ocr
Running
To run Google OCR locally (both for running and tests) You need to store in local (not commited to git) files a key to Google service account in JSON format. The path to this file should be set as a GOOGLE_APPLICATION_CREDENTIALS variable.
To run AWS Textract locally (both for running and tests) You need to have properly configured AWS credentials prefferably using environment variables:
AWS_ACCESS_KEY_ID
- The access key for your AWS account.AWS_SECRET_ACCESS_KEY
- The secret key for your AWS account.AWS_DEFAULT_REGION
- Specifies the AWS Region to send the request to. For more information see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html.
Some features might require creation of a config YAML file with local paths or parameters. Examples of such features:
- Keyword Features finder (requires path to directory where local hyperscan databases are/should be located). If you want to use such a config file, the path to this file should be set in MIM_OCR_CONFIG_PATH environmental variable.
Example working values can be found at config/test_mim_ocr_conf.yaml.
Additional Features
Additional features are tested only on python 3.9
NER_FEATURE
More Information
Licence
MIT License. See LICENSE.txt for details.
For Maintainers
Building new version of the package: run python3.9 -m build
from the main folder.
Uploading new version to pypi: twine upload dist/*
(two newly created files).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mim_ocr-0.0.10.tar.gz
.
File metadata
- Download URL: mim_ocr-0.0.10.tar.gz
- Upload date:
- Size: 32.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 858f6405f19378e5604411d993bc0ad61083018fe718cf6eb3f231226c931e4e |
|
MD5 | 8301c44ff24b8b3aa9aa1784140cbefa |
|
BLAKE2b-256 | a748b729890c5fb54f24a8196c140223fb18a5209c8734899021bf1811f0afb2 |
File details
Details for the file mim_ocr-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: mim_ocr-0.0.10-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d733659b6d0c8b95395dcc97022905d17e05e710f5b54fcf99cb42ad4a48f22 |
|
MD5 | 3bac819201cd1aacb73969513b63f281 |
|
BLAKE2b-256 | b30e03327bb90aa3ada8d4789bb99e4d2b2fbe610e1207db4b2e58b13b23408d |