Documents and large language models.
Project description
About
Docprompt is a library for Document AI. It aims to make enterprise-level document analysis easy thanks to the zero-shot capability of large language models.
Supercharged Document Analysis
- Common utilities for interacting with PDFs
- PDF loading and serialization
- PDF byte compression using Ghostscript :ghost:
- Fast rasterization :fire: :rocket:
- Page splitting, re-export with PDFium
- Document Search, powered by Rust :fire:
- Support for most OCR providers with batched inference
- Google :white_check_mark:
- Azure Document Intelligence :red_circle:
- Amazon Textract :red_circle:
- Tesseract :red_circle:
- Prompt Garden for common document analysis tasks zero-shot, including:
- Table Extraction
- Page Classification
- Segmentation
- Key-value extraction
Documents and large language models
- Documentation: https://docprompt.dev
- GitHub: https://github.com/Page-Leaf/docprompt
- PyPI: https://pypi.org/project/docprompt/
- Free software: Apache-2.0
Features
- Representations for common document layout types -
TextBlock
,BoundingBox
, etc - Generic implementations of OCR providers
- Document Search powered by Rust and R-trees :fire:
Installation
Use the package manager pip to install Docprompt.
pip install docprompt
With an OCR provider
pip install "docprompt[google]
With search support
pip install "docprompt[search]"
Usage
Simple Operations
from docprompt import load_document
# Load a document
document = load_document("path/to/my.pdf")
# Rasterize a single page using Ghostscript
page_number = 5
rastered = document.rasterize_page(page_number, dpi=120)
# Split a pdf based on a page range
document_2 = document.split(start=125, stop=130)
Performing OCR
from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider
provider = GoogleOcrProvider.from_service_account_file(
project_id=my_project_id,
processor_id=my_processor_id,
service_account_file=path_to_service_file
)
document = load_document("path/to/my.pdf")
# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)
provider.process_document_node(document_node) # Caches results on the document_node
document_node[0].ocr_result # Access OCR results
Document Search
When a large language model returns a result, we might want to highlight that result for our users. However, language models return results as text, while what we need to show our users requires a page number and a bounding box.
After extracting text from a PDF, we can support this pattern using DocumentProvenanceLocator
, which lives on a DocumentNode
from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider
provider = GoogleOcrProvider.from_service_account_file(
project_id=my_project_id,
processor_id=my_processor_id,
service_account_file=path_to_service_file
)
document = load_document("path/to/my.pdf")
# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)
provider.process_document_node(document_node) # Caches results on the document_node
# With OCR results available, we can now instantiate a locator and search through documents.
document_node.locator.search("John Doe") # This will return a list of all terms across the document that contain "John Doe"
document_node.locator.search("Jane Doe", page_number=4) # Just return results a list of matching results from page 4
This functionality uses a combination of rtree
and the Rust library tantivy
, allowing you to perform thousands of searches in seconds :fire: :rocket:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file docprompt-0.5.0.tar.gz
.
File metadata
- Download URL: docprompt-0.5.0.tar.gz
- Upload date:
- Size: 300.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.15.1 CPython/3.10.12 Linux/6.5.0-27-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf896f39b03ec9a5dfaf9a1cf0626e41e7dd003263904023079ab0959b1eab09 |
|
MD5 | f97daafd45eed9a8aa6a8ab97b58518a |
|
BLAKE2b-256 | ad1aa56476c742cac24c52e47b2fac1b82b3aba4c4ab05ea3bbcfdac17319066 |
File details
Details for the file docprompt-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: docprompt-0.5.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.15.1 CPython/3.10.12 Linux/6.5.0-27-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9cda792f0f50ebbd5091414a9bb65e823c445363c86e713597068e8d659c0fe |
|
MD5 | f7d62be2b6cdab298f24ed893817204b |
|
BLAKE2b-256 | 12d32ff4e5535544eb71d3fb1d27e7fa25eddb61e5f0b9910a05040306b2eef8 |