Documents and large language models.
Project description
Docprompt
Docprompt is a library for Document AI. It aims to make enterprise-level document analysis easy thanks to the zero-shot capability of large language models while also providing a toolset for working with various document formats.
Supercharged Document Analysis
- Common utilities for interacting with PDFs
- PDF loading and serialization
- PDF byte compression using Ghostscript :ghost:
- Fast rasterization :fire: :rocket:
- Page splitting, re-export with PikePDF
- Support for most OCR providers with batched inference
- Google :white_check_mark:
- Azure Document Intelligence :red_circle:
- Amazon Textract :red_circle:
- Tesseract :red_circle:
Documents and large language models
- Documentation: https://docprompt.dev
- GitHub: https://github.com/Page-Leaf/docprompt
- PyPI: https://pypi.org/project/docprompt/
- Free software: Apache-2.0
Features
- Representations for common document layout types -
TextBlock
,BoundingBox
, etc - Generic implementations of OCR providers
Installation
Use the package manager pip to install Docprompt.
pip install docprompt
With an OCR provider
pip install "docprompt[google]
Usage
Simple Operations
from docprompt import load_document
# Load a document
document = load_document("path/to/my.pdf")
# Rasterize a single page using Ghostscript
page_number = 5
rastered = document.rasterize_page(page_number, dpi=120)
# Split a pdf based on a page range
document_2 = document.split(start=125, stop=130)
Performing OCR
from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider
provider = GoogleOcrProvider.from_service_account_file(
project_id=my_project_id,
processor_id=my_processor_id,
service_account_file=path_to_service_file
)
document = load_document("path/to/my.pdf")
# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)
provider.process_document_node(document_node) # Caches results on the document_node
document_node[0].ocr_result # Access OCR results
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
docprompt-0.1.4.tar.gz
(21.2 kB
view hashes)
Built Distribution
docprompt-0.1.4-py3-none-any.whl
(24.9 kB
view hashes)
Close
Hashes for docprompt-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29e4c7840bd844a6618bfae4282cc8a53cf6d5014c4be31703bd1f936e4c2d73 |
|
MD5 | ff9127d22734bbc3ddc98716aa008edd |
|
BLAKE2b-256 | d1f57fd6a38403ff05b67aa667bb587e57add88acce91f7f05bb1eb51cab00c5 |