Skip to main content

Extraction / Parse of any PDF document within minutes

Project description

TODO: Update README file and add docstrings for the module. Also add instructions to build the package

📄 Document Extraction Module

A general-purpose document extraction module designed for developers. This toolkit allows you to build customizable extraction pipelines for your documents with ease using provided APIs and SDK functions.


🚀 Features

  • Define your own extraction schema class / JSON
  • Supports single and batch PDF processing
  • Contains parallelization - Processes multiple PDFs and pages of each of these PDFs at a time
  • Supports all English PDFs, PDFs in indic languages, scanned PDFs, PDFs with tables and complicated patterns.
  • Easy LLM integration via prompt schemas

🔐 API Key Setup

If you are using GPT Models which support good extraction:export LLM_PROVIDER=open-ai and export OPENAI_API_KEY=<<your_key_here>> Specify the models you wish to use in two keys: export OPEN_AI_EXTRACTION_MODEL=<<your_extraction_model>> and export OPEN_AI_PARSE_FORMATING_MODEL=<<your_structured_output_formmating_model>> If you are using landing AI for parsing then: export VISION_AGENT_API_KEY=<<your_key_here>>

📦 Installation

First, install all the required packages using pip:

pip install -r requirements.txt

🧠 Define Your Extraction Schema

Edit extraction_class_type.py to define your custom schema. You'll use three core classes:

  • ExtractionClass: Fields to extract from documents
  • AIAgentClass: Prompts/instructions for the AI Agent
  • OutputExampleClass: (Optional) Output examples for better LLM performance

Example class definitions:

from pydantic import BaseModel, Field
from typing import List
from prompt_processing import generate_prompt_template, generate_prompt_from_schema, PromptSchema

class ExtractionClass(BaseModel):
    name: str = Field(..., description="The name of the person.")
    age: int = Field(..., description="The age of the person.")
    hobbies: List[str] = Field(..., description="A list of hobbies.")
    address: dict = Field(..., description="The address of the person.")

class AIAgentClass(BaseModel):
    information: str = Field(..., description="Information for the AI agent.")
    instruction: str = Field(..., description="Instructions for the AI agent.")
    condition: str = Field(..., description="Conditions for the AI agent.")

class OutputExampleClass(BaseModel):
    """Optional: Provide a sample output format"""


extraction_template = generate_prompt_template(ExtractionClass)
ai_agent_template = generate_prompt_template(AIAgentClass)
output_example_template = generate_prompt_template(OutputExampleClass)

schema = PromptSchema(
    ai_agent_information=ai_agent_template,
    extract_fields=extraction_template,
    output_example=output_example_template,
)

prompt = generate_prompt_from_schema(schema.model_dump())

description in the class variable definition will act as the prompt for each of the field we wish to extract.

📂 Usage

Extracting from multiple PDFs: document_paths is an array of paths to the PDFs output_paths is whereever you would like to save the JSON files parser is True when you want to digitise the entire document, otherwise you just extract from the PDF combine_pages - Use it if you want to extract from smaller PDFs (2-3 pages) and you want to extract them together.

from extraction import extract_multiple_pdfs

document_paths = ["./152_double.pdf", "./152.pdf"]
output_paths = ["./152_extracted.json", "./152_double_extracted.json"]

multi_pdf_extraction_result = await extract_multiple_pdfs(
   input_paths=document_paths,
   output_paths=output_paths,
   parser=False,
   combine_page=False,
   prompt=prompt,
   extraction_schema_class=ExtractionClass
)

Extraction from single PDF:

from extraction import extract_multiple_pages_async
from file_process import base_64_conversion

pdf_path = "./example.pdf"
base64_images = base_64_conversation(input_type="PDF", file_path=pdf_path)

single_pdf_extraction_result = await extract_multiple_pages_async(
    input_type="PDF",
    base64_images=base64_images,
    text_inputs=[],
    combine_pages=False,
    prompt=prompt,
    extraction_schema_class=ExtractionClass
)

As of now combine_pages is disabled due to context window length related challenges.

📁 Output

The extracted information is saved as JSON in the output_path. The format matches exactly the fields defined in your ExtractionClass.

💡 Tips

  • Keep prompts in AIAgentClass clear and detailed for better LLM results.
  • Test your ExtractionClass with a few sample documents before scaling.
  • Make use of OutputExampleClass to improve LLM consistency.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_extraction_module-0.1.1.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_extraction_module-0.1.1-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file document_extraction_module-0.1.1.tar.gz.

File metadata

File hashes

Hashes for document_extraction_module-0.1.1.tar.gz
Algorithm Hash digest
SHA256 52d5c538ba83d877f46fb8a26f22d2a65219a93620676b629a643249d31b17db
MD5 48282dc92fcdf293613ee39427652dbc
BLAKE2b-256 e4eb91e54d851cbbe1bbff1f6d276fcf9b2bb261e49c10e7f7036785f76f6848

See more details on using hashes here.

File details

Details for the file document_extraction_module-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for document_extraction_module-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e0eb8b606802200e73c8574d32bf8d985a7b333f446b68d096269a6d27050cc9
MD5 1c659d0564df7986b72c9d6dc5ab2b2e
BLAKE2b-256 13f94fbcfa6f2a7b204070429d0eccb990e4ae2c4daa5bf6fdfeba79f1981c8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page