Skip to main content

Documents and large language models.

Project description

Docprompt

Docprompt is a library for Document AI. It aims to make enterprise-level document analysis easy thanks to the zero-shot capability of large language models while also providing a toolset for working with various document formats.

Supercharged Document Analysis

  • Common utilities for interacting with PDFs
    • PDF loading and serialization
    • PDF byte compression using Ghostscript :ghost:
    • Fast rasterization :fire: :rocket:
    • Page splitting, re-export with PikePDF
  • Support for most OCR providers with batched inference
    • Google :white_check_mark:
    • Azure Document Intelligence :red_circle:
    • Amazon Textract :red_circle:
    • Tesseract :red_circle:

pypi python Build Status codecov

Documents and large language models

Features

  • Representations for common document layout types - TextBlock, BoundingBox, etc
  • Generic implementations of OCR providers

Installation

Use the package manager pip to install Docprompt.

pip install docprompt

With an OCR provider

pip install "docprompt[google]

Usage

Simple Operations

from docprompt import load_document

# Load a document
document = load_document("path/to/my.pdf")

# Rasterize a single page using Ghostscript
page_number = 5
rastered = document.rasterize_page(page_number, dpi=120)

# Split a pdf based on a page range
document_2 = document.split(start=125, stop=130)

Performing OCR

from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider

provider = GoogleOcrProvider.from_service_account_file(
  project_id=my_project_id,
  processor_id=my_processor_id,
  service_account_file=path_to_service_file
)

document = load_document("path/to/my.pdf")

# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)

provider.process_document_node(document_node) # Caches results on the document_node

document_node[0].ocr_result # Access OCR results

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprompt-0.1.4.tar.gz (21.2 kB view hashes)

Uploaded Source

Built Distribution

docprompt-0.1.4-py3-none-any.whl (24.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page