Skip to main content

Tools for running OCR against files stored in S3

Project description

s3-ocr

PyPI Changelog Tests License

Tools for running OCR against files stored in S3

Installation

Install this tool using pip:

pip install s3-ocr

Starting OCR against PDFs in a bucket

The start command takes a list of keys and submits them to Textract for OCR processing.

You need to have AWS configured using environment variables or a credentials file in your home directory.

You can start the process running like this:

s3-ocr start name-of-your-bucket my-pdf-file.pdf

The paths you specify should be paths within the bucket. If you stored your PDF files in folders inside the bucket it should look like this:

s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf

OCR can take some time. The results of the OCR will be stored in textract-output in your bucket.

To process every file in the bucket with a .pdf extension use --all:

s3-ocr start name-of-bucket --all

s3-ocr start --help

Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]...

  Start OCR tasks for PDF files in an S3 bucket

      s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf

  To process every file with a .pdf extension:

      s3-ocr start name-of-bucket --all

Options:
  --all                 Process all PDF files in the bucket
  --access-key TEXT     AWS access key ID
  --secret-key TEXT     AWS secret access key
  --session-token TEXT  AWS session token
  --endpoint-url TEXT   Custom endpoint URL
  -a, --auth FILENAME   Path to JSON/INI file containing credentials
  --help                Show this message and exit.

Checking status

The s3-ocr status <bucket-name> command shows a rough indication of progress through the tasks:

% s3-ocr status sfms-history
153 complete out of 532 jobs

It compares the jobs that have been submitted, based on .s3-ocr.json files, to the jobs that have their results written to the textract-output/ folder.

s3-ocr status --help

Usage: s3-ocr status [OPTIONS] BUCKET

  Show status of OCR jobs for a bucket

Options:
  --access-key ...

Fetching the results

Once an OCR job has completed you can download the resulting JSON using the fetch command:

s3-ocr fetch name-of-bucket path/to/file.pdf

This will save files in the current directory with names like this:

  • 4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-1.json
  • 4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-2.json

The number of files will vary depending on the length of the document.

If you don't want separate files you can combine them together using the -c/--combine option:

s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json

The output.json file will then contain data that looks something like this:

{
  "Blocks": [
    {
      "BlockType": "PAGE",
      "Geometry": {...}
      "Page": 1,
      ...
    },
    {
      "BlockType": "LINE",
      "Page": 1,
      ...
      "Text": "Barry",
    },

s3-ocr fetch --help

Usage: s3-ocr fetch [OPTIONS] BUCKET KEY

  Fetch the OCR results for a specified file

      s3-ocr fetch name-of-bucket path/to/key.pdf

  This will save files in the current directory called things like

      a806e67e504fc15f...48314e-1.json     a806e67e504fc15f...48314e-2.json

  To combine these together into a single JSON file with a specified name, use:

      s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json

  Use "--output -" to print the combined JSON to standard output instead.

Options:
  -c, --combine FILENAME  Write combined JSON to file
  --access-key ...

Fetching just the text of a page

If you don't want to deal with the JSON directly, you can use the text command to retrieve just the text extracted from a PDF:

s3-ocr text name-of-bucket path/to/file.pdf

This will output plain text to standard output.

To save that to a file, use this:

s3-ocr text name-of-bucket path/to/file.pdf > text.txt

Separate pages will be separated by three newlines. To separate them using a ---- horizontal divider instead add --divider:

s3-ocr text name-of-bucket path/to/file.pdf --divider

s3-ocr text --help

Usage: s3-ocr text [OPTIONS] BUCKET KEY

  Retrieve the text from an OCRd PDF file

      s3-ocr text name-of-bucket path/to/key.pdf

Options:
  --divider             Add ---- between pages
  --access-key ...

Changes made to your bucket

To keep track of which files have been submitted for processing, s3-ocr will create a JSON file for every file that it adds to the OCR queue.

This file will be called:

path-to-file/name-of-file.pdf.s3-ocr.json

Each of these JSON files contains data that looks like this:

{
  "job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
  "etag": "\"b0c77472e15500347ebf46032a454e8e\""
}

The recorded job_id can be used later to associate the file with the results of the OCR task in textract-output/.

The etag is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.

This design for the tool, with the .s3-ocr.json files tracking jobs that have been submitted, means that it is safe to run s3-ocr start against the same bucket multiple times without the risk of starting duplicate OCR jobs.

Creating a SQLite index of your OCR results

The s3-ocr index <bucket> <database_file> command creates a SQLite database contaning the results of the OCR, and configure SQLite full-text search for the text:

% s3-ocr index sfms-history index.db
Fetching job details  [####################################]  100%
Populating pages table  [####################----------------]   55%  00:03:18

The schema of the resulting database looks like this (excluding the FTS tables):

CREATE TABLE [pages] (
   [path] TEXT,
   [page] INTEGER,
   [folder] TEXT,
   [text] TEXT,
   PRIMARY KEY ([path], [page])
);
CREATE TABLE [ocr_jobs] (
   [key] TEXT PRIMARY KEY,
   [job_id] TEXT,
   [etag] TEXT,
   [s3_ocr_etag] TEXT
);
CREATE TABLE [fetched_jobs] (
   [job_id] TEXT PRIMARY KEY
);

The database is designed to be used with Datasette.

s3-ocr index --help

Usage: s3-ocr index [OPTIONS] BUCKET DATABASE

  Create a SQLite database with OCR results for files in a bucket

Options:
  --access-key ...

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd s3-ocr
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

To regenerate the README file with the latest --help:

cog -r README.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3-ocr-0.3.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

s3_ocr-0.3-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file s3-ocr-0.3.tar.gz.

File metadata

  • Download URL: s3-ocr-0.3.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for s3-ocr-0.3.tar.gz
Algorithm Hash digest
SHA256 201daf650c7295ae6b13844332ab5a121d5dac8db6e153c43224f9fdbf8c50a7
MD5 896d890084917b94e134195e399f378c
BLAKE2b-256 342ccb8b0cd8fac52210ddccc204f561299e7e2ce781e1fcc39ff52a8f394bf5

See more details on using hashes here.

File details

Details for the file s3_ocr-0.3-py3-none-any.whl.

File metadata

  • Download URL: s3_ocr-0.3-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for s3_ocr-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ffe2d137641c7d7feb6d3476519ec637368ed5f24ba4d593e0321a15f5b38b39
MD5 80d7e359b16bccdfc2aad751cf946c21
BLAKE2b-256 9a0fbbd8d55700c8c0a0e0906fc5fc379b9cdf0f8c5be1729d9db4aa0284c3ff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page