Docket Analyzer OCR Utility
Project description
Docket Analyzer OCR
Installation
pip install 'docketanalyzer[ocr]'
Local Usage
Process a document:
from docketanalyzer.ocr import pdf_document
path = 'path/to/doc.pdf
doc = pdf_document(path) # the input can also be raw bytes
doc.process()
for page in doc:
for block in page:
for line in block:
pass
You can also stream pages as they are processed:
doc = pdf_document(path)
for page in doc.stream():
print(page.text)
Pages, blocks, and lines have common attributes:
# where item is a page, block, or line
item.data # A dictionary representation of the item and it's children
item.text # The item's text content
item.page_num # The page the item appears on
item.i # The item-level index
item.id # A unique id constructed from the item and it's parents index (e.g. 3-2-1 for the first line in the second block on the third page).
item.bbox # Bounding box (blocks and lines only)
item.clip() # Extract element as an image from the original pdf
Blocks also have a block type attribute:
print(block.block_type) # 'title', 'text', 'figure', etc.
Save and load data:
# Saving a document
doc.save('doc.json')
# Loading a document
doc = pdf_document(path, load='doc.json')
Remote Usage
You can also serve this tool with Docker.
docker pull nadahlberg/docketanalyzer-ocr:latest
docker run --gpus all -p 8000:8000 nadahlberg/docketanalyzer-ocr:latest
And then use process the document in remote mode:
doc = pdf_document(path, remote=True) # pass endpoint_url if not using localhost
for page in doc.stream():
print(page.text)
S3 Support
When using the remote service, if you want to avoid sending the file in a POST request, configure your S3 credentials. Your document will be temporarily pushed to your bucket to be retrieved by the service.
To configure your S3 credentials run:
da configure s3
Or set the following in your env:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_BUCKET_NAME
AWS_S3_ENDPOINT_URL
Usage is identical. We default to using S3 if credentials are available. You can control this explicitly by passing use_s3=False to pdf_document.
Serverless Support
For serverless usage you can deploy this to RunPod. To get set up:
- Create a serverless worker on RunPod using the docker container.
nadahlberg/docketanalyzer-ocr:latest
- Add the following custom run command.
python -u handler.py
- Add your S3 credentials to the RunPod worker.
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_BUCKET_NAME
AWS_S3_ENDPOINT_URL
- On your local machine, configure your RunPod key and the worker id.
You can run:
da configure runpod
Or set the following in your env:
RUNPOD_API_KEY
RUNPOD_OCR_ENDPOINT_ID
Usage is otherwise identical, just use remote=True with pdf_document
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docketanalyzer_ocr-0.1.1.tar.gz.
File metadata
- Download URL: docketanalyzer_ocr-0.1.1.tar.gz
- Upload date:
- Size: 36.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e3e6c55a47023b0cd93eb46dc162be736953a1e57c87feec0e20db9c17c5e3f
|
|
| MD5 |
ef4c0a1c3cb505cb5f3e9b081c52e891
|
|
| BLAKE2b-256 |
24e7f28f0a9031a4abb72c146f8aa7ce65c454ab62954b6e9d56ceb58b3a5964
|
File details
Details for the file docketanalyzer_ocr-0.1.1-py3-none-any.whl.
File metadata
- Download URL: docketanalyzer_ocr-0.1.1-py3-none-any.whl
- Upload date:
- Size: 36.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3e3180c53d12d2d6594cd4ca859b0afbf6301fc125417e236a649bb03617538
|
|
| MD5 |
49d95756874fa7099a8277329d7b3136
|
|
| BLAKE2b-256 |
a2b7dcf32078232156464218dbb587227a50b788c0ee043be93a26094b10e0e2
|