Extraction / Parse of any PDF document within minutes
Project description
TODO: Update README file and add docstrings for the module. Also add instructions to build the package
📄 Document Extraction Module
A general-purpose document extraction module designed for developers. This toolkit allows you to build customizable extraction pipelines for your documents with ease using provided APIs and SDK functions.
🚀 Features
- Define your own extraction schema class / JSON
- Supports single and batch PDF processing
- Contains parallelization - Processes multiple PDFs and pages of each of these PDFs at a time
- Supports all English PDFs, PDFs in indic languages, scanned PDFs, PDFs with tables and complicated patterns.
- Easy LLM integration via prompt schemas
🔐 API Key Setup
If you are using GPT Models which support good extraction:export LLM_PROVIDER=open-ai and export OPENAI_API_KEY=<<your_key_here>> Specify the models you wish to use in two keys: export OPEN_AI_EXTRACTION_MODEL=<<your_extraction_model>> and export OPEN_AI_PARSE_FORMATING_MODEL=<<your_structured_output_formmating_model>> If you are using landing AI for parsing then: export VISION_AGENT_API_KEY=<<your_key_here>>
📦 Installation
First, install all the required packages using pip:
pip install -r requirements.txt
🧠 Define Your Extraction Schema
Edit extraction_class_type.py to define your custom schema. You'll use three core classes:
- ExtractionClass: Fields to extract from documents
- AIAgentClass: Prompts/instructions for the AI Agent
- OutputExampleClass: (Optional) Output examples for better LLM performance
Example class definitions:
from pydantic import BaseModel, Field
from typing import List
from prompt_processing import generate_prompt_template, generate_prompt_from_schema, PromptSchema
class ExtractionClass(BaseModel):
name: str = Field(..., description="The name of the person.")
age: int = Field(..., description="The age of the person.")
hobbies: List[str] = Field(..., description="A list of hobbies.")
address: dict = Field(..., description="The address of the person.")
class AIAgentClass(BaseModel):
information: str = Field(..., description="Information for the AI agent.")
instruction: str = Field(..., description="Instructions for the AI agent.")
condition: str = Field(..., description="Conditions for the AI agent.")
class OutputExampleClass(BaseModel):
"""Optional: Provide a sample output format"""
extraction_template = generate_prompt_template(ExtractionClass)
ai_agent_template = generate_prompt_template(AIAgentClass)
output_example_template = generate_prompt_template(OutputExampleClass)
schema = PromptSchema(
ai_agent_information=ai_agent_template,
extract_fields=extraction_template,
output_example=output_example_template,
)
prompt = generate_prompt_from_schema(schema.model_dump())
description in the class variable definition will act as the prompt for each of the field we wish to extract.
📂 Usage
Extracting from multiple PDFs: document_paths is an array of paths to the PDFs output_paths is whereever you would like to save the JSON files parser is True when you want to digitise the entire document, otherwise you just extract from the PDF combine_pages - Use it if you want to extract from smaller PDFs (2-3 pages) and you want to extract them together.
from extraction import extract_multiple_pdfs
document_paths = ["./152_double.pdf", "./152.pdf"]
output_paths = ["./152_extracted.json", "./152_double_extracted.json"]
multi_pdf_extraction_result = await extract_multiple_pdfs(
input_paths=document_paths,
output_paths=output_paths,
parser=False,
combine_page=False,
prompt=prompt,
extraction_schema_class=ExtractionClass
)
Extraction from single PDF:
from extraction import extract_multiple_pages_async
from file_process import base_64_conversion
pdf_path = "./example.pdf"
base64_images = base_64_conversation(input_type="PDF", file_path=pdf_path)
single_pdf_extraction_result = await extract_multiple_pages_async(
input_type="PDF",
base64_images=base64_images,
text_inputs=[],
combine_pages=False,
prompt=prompt,
extraction_schema_class=ExtractionClass
)
As of now combine_pages is disabled due to context window length related challenges.
📁 Output
The extracted information is saved as JSON in the output_path.
The format matches exactly the fields defined in your ExtractionClass.
💡 Tips
- Keep prompts in
AIAgentClassclear and detailed for better LLM results. - Test your
ExtractionClasswith a few sample documents before scaling. - Make use of
OutputExampleClassto improve LLM consistency.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_extraction_module-0.1.1.tar.gz.
File metadata
- Download URL: document_extraction_module-0.1.1.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52d5c538ba83d877f46fb8a26f22d2a65219a93620676b629a643249d31b17db
|
|
| MD5 |
48282dc92fcdf293613ee39427652dbc
|
|
| BLAKE2b-256 |
e4eb91e54d851cbbe1bbff1f6d276fcf9b2bb261e49c10e7f7036785f76f6848
|
File details
Details for the file document_extraction_module-0.1.1-py3-none-any.whl.
File metadata
- Download URL: document_extraction_module-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0eb8b606802200e73c8574d32bf8d985a7b333f446b68d096269a6d27050cc9
|
|
| MD5 |
1c659d0564df7986b72c9d6dc5ab2b2e
|
|
| BLAKE2b-256 |
13f94fbcfa6f2a7b204070429d0eccb990e4ae2c4daa5bf6fdfeba79f1981c8f
|