Skip to main content

Tools for running OCR against files stored in S3

Project description

s3-ocr

PyPI Changelog Tests License

Tools for running OCR against files stored in S3

Installation

Install this tool using pip:

pip install s3-ocr

Usage

The start command loops through every PDF file in a bucket (every file ending in .pdf) and submits it to Textract for OCR processing.

You need to have AWS configured using environment variables or a credentials file in your home directory.

You can start the process running like this:

s3-ocr start name-of-your-bucket

OCR can take some time. The results of the OCR will be stored in textract-output in your bucket.

Changes made to your bucket

To keep track of which files have been submitted for processing, s3-ocr will create a JSON file for every file that it adds to the OCR queue.

This file will be called:

path-to-file/name-of-file.pdf.s3-ocr.json

Each of these JSON files contains data that looks like this:

{
  "job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
  "etag": "\"b0c77472e15500347ebf46032a454e8e\""
}

The recorded job_id can be used later to associate the file with the results of the OCR task in textract-output/.

The etag is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.

This design for the tool, with the .s3-ocr.json files tracking jobs that have been submitted, means that it is safe to run s3-ocr start against the same bucket multiple times without the risk of starting duplicate OCR jobs.

Checking status

The s3-ocr status <bucket-name> command shows a rough indication of progress through the tasks:

% s3-ocr status sfms-history
153 complete out of 532 jobs

It compares the jobs that have been submitted, based on .s3-ocr.json files, to the jobs that have their results written to the textract-output/ folder.

Not yet implemented

  • A command to retrieve the OCR results and load them into a searchable SQLite database table.

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd s3-ocr
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3-ocr-0.1a0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

s3_ocr-0.1a0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file s3-ocr-0.1a0.tar.gz.

File metadata

  • Download URL: s3-ocr-0.1a0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for s3-ocr-0.1a0.tar.gz
Algorithm Hash digest
SHA256 18e52bf2223abad1cfec31e5af0e1121ec5b84664118518c52bc188cb29a75b8
MD5 bd12cd239b4e7fefb059c679df1bc68b
BLAKE2b-256 80e3af93ff2d2e8656979ca6d71b9a38126b3b843eb25b597645bd26245b98f3

See more details on using hashes here.

File details

Details for the file s3_ocr-0.1a0-py3-none-any.whl.

File metadata

  • Download URL: s3_ocr-0.1a0-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for s3_ocr-0.1a0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9e43b512b38f75c705cbf78d149ec1fc62b2298a56d459b40ab4ae0d355af6c
MD5 f91e88eb8c2a89475539429693e90ae4
BLAKE2b-256 f9d33249bc60b97f6c916cf4fffa350c8278a12f8b20ef1104d1dc50330e7506

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page