Tools for running OCR against files stored in S3
Project description
s3-ocr
Tools for running OCR against files stored in S3
Installation
Install this tool using pip
:
pip install s3-ocr
Usage
The start
command loops through every PDF file in a bucket (every file ending in .pdf
) and submits it to Textract for OCR processing.
You need to have AWS configured using environment variables or a credentials file in your home directory.
You can start the process running like this:
s3-ocr start name-of-your-bucket
OCR can take some time. The results of the OCR will be stored in textract-output
in your bucket.
Changes made to your bucket
To keep track of which files have been submitted for processing, s3-ocr
will create a JSON file for every file that it adds to the OCR queue.
This file will be called:
path-to-file/name-of-file.pdf.s3-ocr.json
Each of these JSON files contains data that looks like this:
{
"job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
"etag": "\"b0c77472e15500347ebf46032a454e8e\""
}
The recorded job_id
can be used later to associate the file with the results of the OCR task in textract-output/
.
The etag
is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.
This design for the tool, with the .s3-ocr.json
files tracking jobs that have been submitted, means that it is safe to run s3-ocr start
against the same bucket multiple times without the risk of starting duplicate OCR jobs.
Checking status
The s3-ocr status <bucket-name>
command shows a rough indication of progress through the tasks:
% s3-ocr status sfms-history
153 complete out of 532 jobs
It compares the jobs that have been submitted, based on .s3-ocr.json
files, to the jobs that have their results written to the textract-output/
folder.
Not yet implemented
- A command to retrieve the OCR results and load them into a searchable SQLite database table.
Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd s3-ocr
python -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file s3-ocr-0.1a0.tar.gz
.
File metadata
- Download URL: s3-ocr-0.1a0.tar.gz
- Upload date:
- Size: 8.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18e52bf2223abad1cfec31e5af0e1121ec5b84664118518c52bc188cb29a75b8 |
|
MD5 | bd12cd239b4e7fefb059c679df1bc68b |
|
BLAKE2b-256 | 80e3af93ff2d2e8656979ca6d71b9a38126b3b843eb25b597645bd26245b98f3 |
File details
Details for the file s3_ocr-0.1a0-py3-none-any.whl
.
File metadata
- Download URL: s3_ocr-0.1a0-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9e43b512b38f75c705cbf78d149ec1fc62b2298a56d459b40ab4ae0d355af6c |
|
MD5 | f91e88eb8c2a89475539429693e90ae4 |
|
BLAKE2b-256 | f9d33249bc60b97f6c916cf4fffa350c8278a12f8b20ef1104d1dc50330e7506 |