Skip to main content

Interactive extraction of selected text from images and batch processing of stored image files.

Project description

pictureTextCrop

This application can be run by installing the python package dependencies using pip and then simply unzipping the source archive, pictureTextCrop.tar.gz into a new directory that you create to contain the application and its files. The python packages required, available at PyPI.org, include one rather large one, PyQt5. The complete list of packages required is: PyQt5, pytesseract, and Pillow, which is included as a library module in current releases of Python.

Installation Steps:

  1. Download the source code archive from this account: PictureTextCrop-main.zip.

  2. Put it in a folder you make to host this application.

  3. Extract the source code archive in its application folder.

  4. Go into the extrated folder, then into the src/ folder. You will find the main python file: pictureTextCrop.py. This is the file you will run in a console. Open a console / terminal in this folder. This is the working folder for the application.

  5. Check your python version with python -V or python3 -V. If these do not work try the -h option for a list of options available. The version must be 3.6 or higher. (Installing python is beyond the scope of this Quick Start guide.)

  6. If the python package installer, pip, is not installed, execute the following command in the console:

    sudo apt install python3-pip
    
  7. To install the python GUI package required, execute:

    pip install PyQt5
    
  8. Then to install the text extraction package, run:

    pip install pytesseract
    
  then
  
      sudo apt install tesseract_ocr
  1. Although sqlite3 is provided in python as a library, you will want to use it outside of this context and will want a GUI viewer for the database file:

    sudo apt install sqlite3
    
    sudo apt install squlitebrowser
    

Once these steps succeed, go into the PictureTextCrop-main folder, then into its src/ folder, and locate the pictureTextCrop.py source file. To start the application, execute the following on the command line:

$ python3 pictureTextCrop.py

Introduction:

This application is a simple tool which allows you to interactively extract text from image files by dragging from the top-left corner of a crop rectangle to the bottom-right corner. When you release the mouse button the text inside the rectangle is printed to the console and a record of the conversion and its result is made in the SQLite3 database, TextExtraction.db, in the CropLog table. Each record of this table records the date and time of the extraction in its timeStamp field, the full path to the file on your disk of the image file in the filePath field, the coordinates of the crop in the coordinates field, and the text of the crop in the text field. The value of storing all extractions in a database table is simply that you can then search large numbers of images containing particular text using SQL.

The image formats recognized are those recognized by the PyQy5 QImageReader class, which includes those associated with the following file extensions:

'bmp', 'cur', 'gif', 'icns', 'ico', 'jpeg', 'jpg', 'mng', 'pbm', 'pgm', 'png', 'ppm', 'svg', 'svgz', 'tga', 'tif', 'tiff', 'wbmp', 'webp', 'xbm', 'xpm'

The application main window displays a list on the left which will contain the paths of the image files identified when you select folders to scan. In the right half of the dialog will be a space which shows the image selected when you click on a file path in the list. The dialog can be resized to accommodate your needs.

To extract text from an image, select the checkbox located to the left of it and click on the "Select Image Text" button in the button-bar at the top. You can select multiple files and a text selection dialog will be stacked up for each. Press the left mouse button at the top left of a rectangle containing the text you want to extract and then drag to the lower right corner. When you release the button the extracted text is printed to the console and a record of the extraction and its result is made in the SQLite3 database. You can do this as many times as you like for each image. A separate CropLog record will be made for each.

Batch Mode:

If you click on the "Batch Process" button in the tool bar the application will extract all of the text from each of the files in the folder scan list. This process can take some time, my own trials on my ordinary laptop resulting in a pace of more than 5 seconds per image, sometimes 15 or so. With only 50 images, the process could take up to 10 minutes. The resulting extracted text will be placed into the BatchMaster table in the SQLite3 database. The fields written in each record in include the extraction TimeStamp, the FolderPath, the FileName, the Text, and the file metadata in the Info field. Views for reviewing the file metadata and other components saved in convenient formats are planned.

MIME Information Collection:

Coming Soon.

When processing digital evidence you must be careful to preserve the file metadata, especially the time stamps. Know how operating systems treat timestamps and permissions when you copy files and how your archiving applications do as well.

If you would like to support this project, you can contribute here: https://www.buymeacoffee.com/keithmichah

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pictureTextCrop-0.6.0.tar.gz (199.9 kB view details)

Uploaded Source

Built Distribution

pictureTextCrop-0.6.0-py3-none-any.whl (208.5 kB view details)

Uploaded Python 3

File details

Details for the file pictureTextCrop-0.6.0.tar.gz.

File metadata

  • Download URL: pictureTextCrop-0.6.0.tar.gz
  • Upload date:
  • Size: 199.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for pictureTextCrop-0.6.0.tar.gz
Algorithm Hash digest
SHA256 8f24dd89524abbd86f6db2cc86a86c1fd83b7fd397963f5f40c8a21bf4c6527d
MD5 a318dd62affadfa4ea0bcb12367f4620
BLAKE2b-256 e8da90db3e3294affa6faebf34e926703b48e0c39afaf67dc9157332bfae26f8

See more details on using hashes here.

File details

Details for the file pictureTextCrop-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pictureTextCrop-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f30c18849b91f64f966a90f402ff5f5d157c25711736563d99c2effdfe40ed9b
MD5 bc828533b6e0ed159fcb2b02684042e2
BLAKE2b-256 8e484fa618b5fa024336de5d273b05852d0f1c7e4df3579a2512d97a28aab8ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page