Skip to main content

Documents and large language models.

Project description

pypi python Build Status codecov pdm-managed


Logo

Docprompt

Document AI, powered by LLM's
Explore the docs »

· Report Bug · Request Feature

About

Docprompt is a library for Document AI. It aims to make enterprise-level document analysis easy thanks to the zero-shot capability of large language models.

Supercharged Document Analysis

  • Common utilities for interacting with PDFs
    • PDF loading and serialization
    • PDF byte compression using Ghostscript :ghost:
    • Fast rasterization :fire: :rocket:
    • Page splitting, re-export with PDFium
    • Document Search, powered by Rust :fire:
  • Support for most OCR providers with batched inference
    • Google :white_check_mark:
    • Amazon Textract :white_check_mark:
    • Tesseract :white_check_mark:
    • Azure Document Intelligence :red_circle:
  • Layout Aware Page Representation
    • Run Document Layout Analysis with text-only LLM's!
  • Prompt Garden for common document analysis tasks zero-shot, including:
    • Markerization (Pdf2Markdown)
    • Table Extraction
    • Page Classification
    • Key-value extraction (Coming soon)
    • Segmentation (Coming soon)

Documents and large language models

Features

  • Representations for common document layout types - TextBlock, BoundingBox, etc
  • Generic implementations of OCR providers
  • Document Search powered by Rust and R-trees :fire:
  • Table Extraction, Page Classification, PDF2Markdown

Installation

Use the package manager pip to install Docprompt.

pip install docprompt

With an OCR provider

pip install "docprompt[google]

With search support

pip install "docprompt[search]"

Usage

Simple Operations

from docprompt import load_document

# Load a document
document = load_document("path/to/my.pdf")

# Rasterize a single page using Ghostscript
page_number = 5
rastered = document.rasterize_page(page_number, dpi=120)

# Split a pdf based on a page range
document_2 = document.split(start=125, stop=130)

Converting a PDF to markdown

Coverting documents into markdown is a great way to prepare documents for downstream chunking or ingestion into a RAG system.

from docprompt import load_document_node
from docprompt.tasks.markerize import AnthropicMarkerizeProvider

document_node = load_document_node("path/to/my.pdf")
markerize_provider = AnthropicMarkerizeProvider()

markerized_document = markerize_provider.process_document_node(document_node)

Extracting Tables

Extract tables with SOTA speed and accuracy.

from docprompt import load_document_node
from docprompt.tasks.table_extraction import AnthropicTableExtractionProvider

document_node = load_document_node("path/to/my.pdf")
table_extraction_provider = AnthropicTableExtractionProvider()

extracted_tables = table_extraction_provider.process_document_node(document_node)

Performing OCR

from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider

provider = GoogleOcrProvider.from_service_account_file(
  project_id=my_project_id,
  processor_id=my_processor_id,
  service_account_file=path_to_service_file
)

document = load_document("path/to/my.pdf")

# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)

provider.process_document_node(document_node) # Caches results on the document_node

document_node[0].ocr_result # Access OCR results

Document Search

When a large language model returns a result, we might want to highlight that result for our users. However, language models return results as text, while what we need to show our users requires a page number and a bounding box.

After extracting text from a PDF, we can support this pattern using DocumentProvenanceLocator, which lives on a DocumentNode

from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider

provider = GoogleOcrProvider.from_service_account_file(
  project_id=my_project_id,
  processor_id=my_processor_id,
  service_account_file=path_to_service_file
)

document = load_document("path/to/my.pdf")

# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)

provider.process_document_node(document_node) # Caches results on the document_node

# With OCR results available, we can now instantiate a locator and search through documents.

document_node.locator.search("John Doe") # This will return a list of all terms across the document that contain "John Doe"
document_node.locator.search("Jane Doe", page_number=4) # Just return results a list of matching results from page 4

This functionality uses a combination of rtree and the Rust library tantivy, allowing you to perform thousands of searches in seconds :fire: :rocket:

trackgit-views

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprompt-0.8.3.tar.gz (19.7 MB view details)

Uploaded Source

Built Distribution

docprompt-0.8.3-py3-none-any.whl (13.6 MB view details)

Uploaded Python 3

File details

Details for the file docprompt-0.8.3.tar.gz.

File metadata

  • Download URL: docprompt-0.8.3.tar.gz
  • Upload date:
  • Size: 19.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.18.1 CPython/3.10.12 Linux/6.8.0-40-generic

File hashes

Hashes for docprompt-0.8.3.tar.gz
Algorithm Hash digest
SHA256 4a4269d8ad6aceefd76b500e01056c2dd18227499e035b0a9c3c9e94b6405388
MD5 b6b06a75f972819f8b864dd6805cff60
BLAKE2b-256 df4863c306af5b77a373521eff55bb7e0a97f5ea8558241c5ff5c11ccbe6f023

See more details on using hashes here.

File details

Details for the file docprompt-0.8.3-py3-none-any.whl.

File metadata

  • Download URL: docprompt-0.8.3-py3-none-any.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.18.1 CPython/3.10.12 Linux/6.8.0-40-generic

File hashes

Hashes for docprompt-0.8.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3f4b1c05d1d1ebd4da81ffb46ca3c009f8b9c5b59adc6bab5bd8f7bb0f7887ad
MD5 3fe99dc6ad97b22cd7208a9eeede3a3f
BLAKE2b-256 806ad3acf3b686edd7644154269a3b84810bbace80cc9388b068f4f62d4e84fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page