Skip to main content

Documents and large language models.

Project description

Docprompt

Docprompt is a library for Document AI. It aims to make enterprise-level document analysis easy thanks to the zero-shot capability of large language models while also providing a toolset for working with various document formats.

Supercharged Document Analysis

  • Common utilities for interacting with PDFs
    • PDF loading and serialization
    • PDF byte compression using Ghostscript :ghost:
    • Fast rasterization :fire: :rocket:
    • Page splitting, re-export with PDFium
  • Support for most OCR providers with batched inference
    • Google :white_check_mark:
    • Azure Document Intelligence :red_circle:
    • Amazon Textract :red_circle:
    • Tesseract :red_circle:
  • Prompt Garden for common document analysis tasks zero-shot, including:
    • Table Extraction
    • Page Classification
    • Segmentation
    • Key-value extraction

pypi python Build Status codecov

Documents and large language models

Features

  • Representations for common document layout types - TextBlock, BoundingBox, etc
  • Generic implementations of OCR providers

Installation

Use the package manager pip to install Docprompt.

pip install docprompt

With an OCR provider

pip install "docprompt[google]

With search support

pip install "docprompt[search]"

Usage

Simple Operations

from docprompt import load_document

# Load a document
document = load_document("path/to/my.pdf")

# Rasterize a single page using Ghostscript
page_number = 5
rastered = document.rasterize_page(page_number, dpi=120)

# Split a pdf based on a page range
document_2 = document.split(start=125, stop=130)

Performing OCR

from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider

provider = GoogleOcrProvider.from_service_account_file(
  project_id=my_project_id,
  processor_id=my_processor_id,
  service_account_file=path_to_service_file
)

document = load_document("path/to/my.pdf")

# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)

provider.process_document_node(document_node) # Caches results on the document_node

document_node[0].ocr_result # Access OCR results

Document Search

When a large language model returns a result, we might want to highlight that result for our users. However, language models return results as text, while what we need to show our users requires a page number and a bounding box.

After extracting text from a PDF, we can support this pattern using DocumentProvenanceLocator, which lives on a DocumentNode

from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider

provider = GoogleOcrProvider.from_service_account_file(
  project_id=my_project_id,
  processor_id=my_processor_id,
  service_account_file=path_to_service_file
)

document = load_document("path/to/my.pdf")

# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)

provider.process_document_node(document_node) # Caches results on the document_node

# With OCR results available, we can now instantiate a locator and search through documents.

document_node.locator.search("John Doe") # This will return a list of all terms across the document that contain "John Doe"
document_node.locator.search("Jane Doe", page_number=4) # Just return results a list of matching results from page 4

This functionality uses a combination of rtree and the Rust library tantivy, allowing you to perform thousands of searches in seconds :fire: :rocket:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprompt-0.2.tar.gz (314.9 kB view details)

Uploaded Source

Built Distribution

docprompt-0.2-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file docprompt-0.2.tar.gz.

File metadata

  • Download URL: docprompt-0.2.tar.gz
  • Upload date:
  • Size: 314.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/6.5.0-25-generic

File hashes

Hashes for docprompt-0.2.tar.gz
Algorithm Hash digest
SHA256 00d0f79fe2b5c0421f626aea6239093483fed729e2454cc049665d6431517b5e
MD5 0e1bc3ed39120e22481d4eede3047a1b
BLAKE2b-256 6e3d8cc8ddde1306d1d01cbe9b8e7a055d7943c17b6a9a20c0f75e0088fdf12c

See more details on using hashes here.

File details

Details for the file docprompt-0.2-py3-none-any.whl.

File metadata

  • Download URL: docprompt-0.2-py3-none-any.whl
  • Upload date:
  • Size: 33.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/6.5.0-25-generic

File hashes

Hashes for docprompt-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 da16feeb496448b1c0d8da5ca754504c263d974118198702bb1255cf59bf55e5
MD5 db639ad94c68d54728e9a965dd96c311
BLAKE2b-256 f384b615f3fdb8e3aa234a9dfc904329a047094528290cc8eef30827596b9438

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page