Skip to main content

Docket Analyzer OCR Utility

Project description

Docket Analyzer OCR

Installation

pip install 'docketanalyzer[ocr]'

Local Usage

Process a document:

from docketanalyzer.ocr import pdf_document

path = 'path/to/doc.pdf
doc = pdf_document(path) # the input can also be raw bytes
doc.process()

for page in doc:
    for block in page:
        for line in block:
            pass

You can also stream pages as they are processed:

doc = pdf_document(path)

for page in doc.stream():
    print(page.text)

Pages, blocks, and lines have common attributes:

# where item is a page, block, or line

item.data # A dictionary representation of the item and it's children
item.text # The item's text content
item.page_num # The page the item appears on
item.i # The item-level index
item.id # A unique id constructed from the item and it's parents index (e.g. 3-2-1 for the first line in the second block on the third page).
item.bbox # Bounding box (blocks and lines only)
item.clip() # Extract element as an image from the original pdf

Blocks also have a block type attribute:

print(block.block_type) # 'title', 'text', 'figure', etc.

Save and load data:

# Saving a document
doc.save('doc.json')

# Loading a document
doc = pdf_document(path, load='doc.json')

Remote Usage

You can also serve this tool with Docker.

docker pull nadahlberg/docketanalyzer-ocr:latest
docker run --gpus all -p 8000:8000 nadahlberg/docketanalyzer-ocr:latest

And then use process the document in remote mode:

doc = pdf_document(path, remote=True) # pass endpoint_url if not using localhost

for page in doc.stream():
    print(page.text)

S3 Support

When using the remote service, if you want to avoid sending the file in a POST request, configure your S3 credentials. Your document will be temporarily pushed to your bucket to be retrieved by the service.

To configure your S3 credentials run:

da configure s3

Or set the following in your env:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_BUCKET_NAME
AWS_S3_ENDPOINT_URL

Usage is identical. We default to using S3 if credentials are available. You can control this explicitly by passing use_s3=False to pdf_document.

Serverless Support

For serverless usage you can deploy this to RunPod. To get set up:

  1. Create a serverless worker on RunPod using the docker container.
nadahlberg/docketanalyzer-ocr:latest
  1. Add the following custom run command.
python -u handler.py
  1. Add your S3 credentials to the RunPod worker.
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_BUCKET_NAME
AWS_S3_ENDPOINT_URL
  1. On your local machine, configure your RunPod key and the worker id.

You can run:

da configure runpod

Or set the following in your env:

RUNPOD_API_KEY
RUNPOD_OCR_ENDPOINT_ID

Usage is otherwise identical, just use remote=True with pdf_document

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docketanalyzer_ocr-0.1.9.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docketanalyzer_ocr-0.1.9-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file docketanalyzer_ocr-0.1.9.tar.gz.

File metadata

  • Download URL: docketanalyzer_ocr-0.1.9.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for docketanalyzer_ocr-0.1.9.tar.gz
Algorithm Hash digest
SHA256 0d1af6599266630c104c15c0f9e3ce62ec12708dc38bfd031f5adcd880e8ffc4
MD5 33126f3477a543c498b5fc418a7972f8
BLAKE2b-256 0385511acd7f24275fd9e7dcd689371e1f50ee4f6246b08e54b849e5151e659a

See more details on using hashes here.

File details

Details for the file docketanalyzer_ocr-0.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for docketanalyzer_ocr-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 e0d7903cd98f03b1b600ea541f7ad6dade4b331d61eab7b118222015bc0eb865
MD5 e86f242e6fe3b4d490f32620f6e57cfe
BLAKE2b-256 1b5b5da7426580e692d7846a208ae7a76f6d3a36b85d617cb1799a743fed8df7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page