Tools for running OCR against files stored in S3
Project description
s3-ocr
Tools for running OCR against files stored in S3
Installation
Install this tool using pip
:
pip install s3-ocr
Starting OCR against PDFs in a bucket
The start
command takes a list of keys and submits them to Textract for OCR processing.
You need to have AWS configured using environment variables or a credentials file in your home directory.
You can start the process running like this:
s3-ocr start name-of-your-bucket my-pdf-file.pdf
The paths you specify should be paths within the bucket. If you stored your PDF files in folders inside the bucket it should look like this:
s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf
OCR can take some time. The results of the OCR will be stored in textract-output
in your bucket.
To process every file in the bucket with a .pdf
extension use --all
:
s3-ocr start name-of-bucket --all
s3-ocr start --help
Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]...
Start OCR tasks for PDF files in an S3 bucket
s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf
To process every file with a .pdf extension:
s3-ocr start name-of-bucket --all
Options:
--all Process all PDF files in the bucket
--access-key TEXT AWS access key ID
--secret-key TEXT AWS secret access key
--session-token TEXT AWS session token
--endpoint-url TEXT Custom endpoint URL
-a, --auth FILENAME Path to JSON/INI file containing credentials
--help Show this message and exit.
Checking status
The s3-ocr status <bucket-name>
command shows a rough indication of progress through the tasks:
% s3-ocr status sfms-history
153 complete out of 532 jobs
It compares the jobs that have been submitted, based on .s3-ocr.json
files, to the jobs that have their results written to the textract-output/
folder.
s3-ocr status --help
Usage: s3-ocr status [OPTIONS] BUCKET
Show status of OCR jobs for a bucket
Options:
--access-key ...
Fetching the results
Once an OCR job has completed you can download the resulting JSON using the fetch
command:
s3-ocr fetch name-of-bucket path/to/file.pdf
This will save files in the current directory with names like this:
4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-1.json
4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-2.json
The number of files will vary depending on the length of the document.
If you don't want separate files you can combine them together using the -c/--combine
option:
s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json
The output.json
file will then contain data that looks something like this:
{
"Blocks": [
{
"BlockType": "PAGE",
"Geometry": {...}
"Page": 1,
...
},
{
"BlockType": "LINE",
"Page": 1,
...
"Text": "Barry",
},
s3-ocr fetch --help
Usage: s3-ocr fetch [OPTIONS] BUCKET KEY
Fetch the OCR results for a specified file
s3-ocr fetch name-of-bucket path/to/key.pdf
This will save files in the current directory called things like
a806e67e504fc15f...48314e-1.json a806e67e504fc15f...48314e-2.json
To combine these together into a single JSON file with a specified name, use:
s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json
Use "--output -" to print the combined JSON to standard output instead.
Options:
-c, --combine FILENAME Write combined JSON to file
--access-key ...
Fetching just the text of a page
If you don't want to deal with the JSON directly, you can use the text
command to retrieve just the text extracted from a PDF:
s3-ocr text name-of-bucket path/to/file.pdf
This will output plain text to standard output.
To save that to a file, use this:
s3-ocr text name-of-bucket path/to/file.pdf > text.txt
Separate pages will be separated by three newlines. To separate them using a ----
horizontal divider instead add --divider
:
s3-ocr text name-of-bucket path/to/file.pdf --divider
s3-ocr text --help
Usage: s3-ocr text [OPTIONS] BUCKET KEY
Retrieve the text from an OCRd PDF file
s3-ocr text name-of-bucket path/to/key.pdf
Options:
--divider Add ---- between pages
--access-key ...
Changes made to your bucket
To keep track of which files have been submitted for processing, s3-ocr
will create a JSON file for every file that it adds to the OCR queue.
This file will be called:
path-to-file/name-of-file.pdf.s3-ocr.json
Each of these JSON files contains data that looks like this:
{
"job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
"etag": "\"b0c77472e15500347ebf46032a454e8e\""
}
The recorded job_id
can be used later to associate the file with the results of the OCR task in textract-output/
.
The etag
is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.
This design for the tool, with the .s3-ocr.json
files tracking jobs that have been submitted, means that it is safe to run s3-ocr start
against the same bucket multiple times without the risk of starting duplicate OCR jobs.
Creating a SQLite index of your OCR results
The s3-ocr index <bucket> <database_file>
command creates a SQLite database contaning the results of the OCR, and configure SQLite full-text search for the text:
% s3-ocr index sfms-history index.db
Fetching job details [####################################] 100%
Populating pages table [####################----------------] 55% 00:03:18
The schema of the resulting database looks like this (excluding the FTS tables):
CREATE TABLE [pages] (
[path] TEXT,
[page] INTEGER,
[folder] TEXT,
[text] TEXT,
PRIMARY KEY ([path], [page])
);
CREATE TABLE [ocr_jobs] (
[key] TEXT PRIMARY KEY,
[job_id] TEXT,
[etag] TEXT,
[s3_ocr_etag] TEXT
);
CREATE TABLE [fetched_jobs] (
[job_id] TEXT PRIMARY KEY
);
The database is designed to be used with Datasette.
s3-ocr index --help
Usage: s3-ocr index [OPTIONS] BUCKET DATABASE
Create a SQLite database with OCR results for files in a bucket
Options:
--access-key ...
Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd s3-ocr
python -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
To regenerate the README file with the latest --help
:
cog -r README.md
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file s3-ocr-0.3.tar.gz
.
File metadata
- Download URL: s3-ocr-0.3.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 201daf650c7295ae6b13844332ab5a121d5dac8db6e153c43224f9fdbf8c50a7 |
|
MD5 | 896d890084917b94e134195e399f378c |
|
BLAKE2b-256 | 342ccb8b0cd8fac52210ddccc204f561299e7e2ce781e1fcc39ff52a8f394bf5 |
File details
Details for the file s3_ocr-0.3-py3-none-any.whl
.
File metadata
- Download URL: s3_ocr-0.3-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffe2d137641c7d7feb6d3476519ec637368ed5f24ba4d593e0321a15f5b38b39 |
|
MD5 | 80d7e359b16bccdfc2aad751cf946c21 |
|
BLAKE2b-256 | 9a0fbbd8d55700c8c0a0e0906fc5fc379b9cdf0f8c5be1729d9db4aa0284c3ff |