Index documents using OCR

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- End Users/Desktop
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.8
Topic
- Education

Project description

StudiOCR

StudiOCR is an application to index notes and make them searchable by using OCR.

Select .JPG or .PNG files to create a document
Search through a document to see if it has any matching text
Any matching text will be highlighted with a colored box based on confidence level

Installation Instructions

Prerequisites

You must have Qt and Tesseract OCR installed.
Ubuntu or Debian
- sudo apt install qt5-default tesseract-ocr
Arch or Manjaro
- sudo pacman -S qt5-base tesseract

Install through PyPi

Optionally, create a fresh venv for installing the package in
- python3 -m venv venv_StudiOCR
- source venv_StudiOCR/bin/activate
  - To deactivate the venv run deactivate
pip install StudiOCR
Once installed, run StudiOCR to launch the application

Install from Source

Optionally, create a fresh venv for installing the package dependencies in
- python3 -m venv venv_StudiOCR
- source venv_StudiOCR/bin/activate
  - To deactivate the venv, rundeactivate
git clone https://github.com/BSpwr/StudiOCR
cd StudiOCR
pip install -r requirements.txt
Once installed, run python3 StudiOCR/main.py to launch the application

Usage

Main Window

Image of MainWindow

Click the Add New Document button to open the add new document window interface
Click on a document thumbnail (which is generated from the first page) to open the document window interface
Toggle remove mode to remove existing documents
Search for a document based on document name by typing in the search bar with the DOC bullet selected
Search for a document based on matching OCR text by typing in the search bar with the OCR bullet selected

Add New Document Window

Image of AddDocument

Add/Remove .JPG or .PNG files to be processed by OCR into a document in the database
Input the document name
Click on the info icon to display a window explaining document options
Select the processing model you wish to use: Best (for accuracy) or Fast (for speed)
Select whether you wish to do image preprocessing (convert to grayscale and increase text contrast)
PSM Number:

PSM Number	Value
3	Fully automatic page segmentation, but no OSD. (Default)
4	Assume a single column of text of variable sizes.
5	Assume a single uniform block of vertically aligned text.
6	Assume a single uniform block of text.
7	Treat the image as a single text line.
8	Treat the image as a single word.
9	Treat the image as a single word in a circle.
10	Treat the image as a single character.
11	Sparse text. Find as much text as possible in no particular order.
12	Sparse text with OSD.
13	Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

Document Window

Image of DocWindow

Enter text in the search bar to search for matching text in the document
Click the Next/Previous Page buttons to cycle through the pages in the document
The current page number is shown at the bottom of the window
Toggle show matching pages to only display pages with matching text and to cycle through them
Box Color:

Box Color	Confidence Value
Green	Greater than or equal to 80
Blue	Less than 80 and greater than or equal to 40
Red	Less than 40

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- End Users/Desktop
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.8
Topic
- Education

Release history Release notifications | RSS feed

0.2.0

Aug 1, 2020

0.1.2

Jul 1, 2020

This version

0.1.1

Jul 1, 2020

0.1.0

Jul 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

StudiOCR-0.1.1.tar.gz (15.9 MB view hashes)

Uploaded Jul 1, 2020 Source

Hashes for StudiOCR-0.1.1.tar.gz

Hashes for StudiOCR-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d21700a558356f3b82acc3f19a26ad56a70a5fd9cf5c8d298b7f6da478e44d2c`
MD5	`25704d8525e193b7cb4b8e539dfd4e67`
BLAKE2b-256	`f71891c0d0dffb4ae8f35a95d4f6d439dfe1c93187eefc28666d163c674b043f`