Skip to main content

Tools for running OCR against files stored in S3

Project description

s3-ocr

PyPI Changelog Tests License

Tools for running OCR against files stored in S3

Project status

This is an alpha tool: it has only been used for a single project, and does not yet have automated tests.

Installation

Install this tool using pip:

pip install s3-ocr

Starting OCR against every PDF in a bucket

The start command loops through every PDF file in a bucket (every file ending in .pdf) and submits it to Textract for OCR processing.

You need to have AWS configured using environment variables or a credentials file in your home directory.

You can start the process running like this:

s3-ocr start name-of-your-bucket

OCR can take some time. The results of the OCR will be stored in textract-output in your bucket.

Usage: s3-ocr start [OPTIONS] BUCKET

  Start OCR tasks for all files in this bucket

Options:
  --access-key TEXT     AWS access key ID
  --secret-key TEXT     AWS secret access key
  --session-token TEXT  AWS session token
  --endpoint-url TEXT   Custom endpoint URL
  -a, --auth FILENAME   Path to JSON/INI file containing credentials
  --help                Show this message and exit.

Changes made to your bucket

To keep track of which files have been submitted for processing, s3-ocr will create a JSON file for every file that it adds to the OCR queue.

This file will be called:

path-to-file/name-of-file.pdf.s3-ocr.json

Each of these JSON files contains data that looks like this:

{
  "job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
  "etag": "\"b0c77472e15500347ebf46032a454e8e\""
}

The recorded job_id can be used later to associate the file with the results of the OCR task in textract-output/.

The etag is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.

This design for the tool, with the .s3-ocr.json files tracking jobs that have been submitted, means that it is safe to run s3-ocr start against the same bucket multiple times without the risk of starting duplicate OCR jobs.

Checking status

The s3-ocr status <bucket-name> command shows a rough indication of progress through the tasks:

% s3-ocr status sfms-history
153 complete out of 532 jobs

It compares the jobs that have been submitted, based on .s3-ocr.json files, to the jobs that have their results written to the textract-output/ folder.

Usage: s3-ocr status [OPTIONS] BUCKET

  Show status of OCR jobs for a bucket

Options:
  --access-key TEXT     AWS access key ID
  --secret-key TEXT     AWS secret access key
  --session-token TEXT  AWS session token
  --endpoint-url TEXT   Custom endpoint URL
  -a, --auth FILENAME   Path to JSON/INI file containing credentials
  --help                Show this message and exit.

Creating a SQLite index of your OCR results

The s3-ocr index <database_file> <bucket> command creates a SQLite database contaning the results of the OCR, and configure SQLite full-text search for the text:

% s3-ocr index index.db sfms-history
Fetching job details  [####################################]  100%
Populating pages table  [####################----------------]   55%  00:03:18

The schema of the resulting database looks like this (excluding the FTS tables):

CREATE TABLE [pages] (
   [path] TEXT,
   [page] INTEGER,
   [folder] TEXT,
   [text] TEXT,
   PRIMARY KEY ([path], [page])
);
CREATE TABLE [ocr_jobs] (
   [key] TEXT PRIMARY KEY,
   [job_id] TEXT,
   [etag] TEXT,
   [s3_ocr_etag] TEXT
);
CREATE TABLE [fetched_jobs] (
   [job_id] TEXT PRIMARY KEY
);

The database is designed to be used with Datasette.

Usage: s3-ocr index [OPTIONS] DATABASE BUCKET

  Show status of OCR jobs for a bucket

Options:
  --access-key TEXT     AWS access key ID
  --secret-key TEXT     AWS secret access key
  --session-token TEXT  AWS session token
  --endpoint-url TEXT   Custom endpoint URL
  -a, --auth FILENAME   Path to JSON/INI file containing credentials
  --help                Show this message and exit.

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd s3-ocr
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

To regenerate the README file with the latest --help:

cog -r README.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3-ocr-0.2a0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

s3_ocr-0.2a0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file s3-ocr-0.2a0.tar.gz.

File metadata

  • Download URL: s3-ocr-0.2a0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for s3-ocr-0.2a0.tar.gz
Algorithm Hash digest
SHA256 ea77b425faee5f3ffb01389969e0dd26c073f8995dea7de613f50583cb76d2a0
MD5 fc3067c37ff5d77c32bbf2f437b46e4c
BLAKE2b-256 7fdf03803749519f85203e410737b46ce9ac359ace4848776c9c7693b4c764ea

See more details on using hashes here.

File details

Details for the file s3_ocr-0.2a0-py3-none-any.whl.

File metadata

  • Download URL: s3_ocr-0.2a0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for s3_ocr-0.2a0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6c5ba48a87068ac844ff8f5761be6269809a41cf3f65e7111effe313933e186
MD5 22d901e00f64e4d16d5508acdbaff790
BLAKE2b-256 27caad426d13b0063d348abdec5f3fd822e47063e530d235119ba7a287d98f3f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page