Extraction / Parse of any PDF document within minutes

Project description

TODO: Update README file and add docstrings for the module. Also add instructions to build the package

📄 Document Extraction Module

A general-purpose document extraction module designed for developers. This toolkit allows you to build customizable extraction pipelines for your documents with ease using provided APIs and SDK functions.

🚀 Features

Define your own extraction schema class / JSON
Supports single and batch PDF processing
Contains parallelization - Processes multiple PDFs and pages of each of these PDFs at a time
Supports all English PDFs, PDFs in indic languages, scanned PDFs, PDFs with tables and complicated patterns.
Easy LLM integration via prompt schemas

🔐 API Key Setup

If you are using GPT Models which support good extraction:export LLM_PROVIDER=open-ai and export OPENAI_API_KEY=<<your_key_here>> Specify the models you wish to use in two keys: export OPEN_AI_EXTRACTION_MODEL=<<your_extraction_model>> and export OPEN_AI_PARSE_FORMATING_MODEL=<<your_structured_output_formmating_model>> If you are using landing AI for parsing then: export VISION_AGENT_API_KEY=<<your_key_here>>

📦 Installation

First, install all the required packages using pip:

pip install -r requirements.txt

🧠 Define Your Extraction Schema

Edit extraction_class_type.py to define your custom schema. You'll use three core classes:

ExtractionClass: Fields to extract from documents
AIAgentClass: Prompts/instructions for the AI Agent
OutputExampleClass: (Optional) Output examples for better LLM performance

Example class definitions:

from pydantic import BaseModel, Field
from typing import List
from prompt_processing import generate_prompt_template, generate_prompt_from_schema, PromptSchema

class ExtractionClass(BaseModel):
    name: str = Field(..., description="The name of the person.")
    age: int = Field(..., description="The age of the person.")
    hobbies: List[str] = Field(..., description="A list of hobbies.")
    address: dict = Field(..., description="The address of the person.")

class AIAgentClass(BaseModel):
    information: str = Field(..., description="Information for the AI agent.")
    instruction: str = Field(..., description="Instructions for the AI agent.")
    condition: str = Field(..., description="Conditions for the AI agent.")

class OutputExampleClass(BaseModel):
    """Optional: Provide a sample output format"""


extraction_template = generate_prompt_template(ExtractionClass)
ai_agent_template = generate_prompt_template(AIAgentClass)
output_example_template = generate_prompt_template(OutputExampleClass)

schema = PromptSchema(
    ai_agent_information=ai_agent_template,
    extract_fields=extraction_template,
    output_example=output_example_template,
)

prompt = generate_prompt_from_schema(schema.model_dump())

description in the class variable definition will act as the prompt for each of the field we wish to extract.

📂 Usage

Extracting from multiple PDFs: document_paths is an array of paths to the PDFs output_paths is whereever you would like to save the JSON files parser is True when you want to digitise the entire document, otherwise you just extract from the PDF combine_pages - Use it if you want to extract from smaller PDFs (2-3 pages) and you want to extract them together.

from extraction import extract_multiple_pdfs

document_paths = ["./152_double.pdf", "./152.pdf"]
output_paths = ["./152_extracted.json", "./152_double_extracted.json"]

multi_pdf_extraction_result = await extract_multiple_pdfs(
   input_paths=document_paths,
   output_paths=output_paths,
   parser=False,
   combine_page=False,
   prompt=prompt,
   extraction_schema_class=ExtractionClass
)

Extraction from single PDF:

from extraction import extract_multiple_pages_async
from file_process import base_64_conversion

pdf_path = "./example.pdf"
base64_images = base_64_conversation(input_type="PDF", file_path=pdf_path)

single_pdf_extraction_result = await extract_multiple_pages_async(
    input_type="PDF",
    base64_images=base64_images,
    text_inputs=[],
    combine_pages=False,
    prompt=prompt,
    extraction_schema_class=ExtractionClass
)

As of now combine_pages is disabled due to context window length related challenges.

📁 Output

The extracted information is saved as JSON in the output_path. The format matches exactly the fields defined in your ExtractionClass.

💡 Tips

Keep prompts in AIAgentClass clear and detailed for better LLM results.
Test your ExtractionClass with a few sample documents before scaling.
Make use of OutputExampleClass to improve LLM consistency.

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Jul 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_extraction_module-0.1.1.tar.gz (15.5 kB view details)

Uploaded Jul 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

document_extraction_module-0.1.1-py3-none-any.whl (19.7 kB view details)

Uploaded Jul 7, 2025 Python 3

File details

Details for the file document_extraction_module-0.1.1.tar.gz.

File metadata

Download URL: document_extraction_module-0.1.1.tar.gz
Upload date: Jul 7, 2025
Size: 15.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for document_extraction_module-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`52d5c538ba83d877f46fb8a26f22d2a65219a93620676b629a643249d31b17db`
MD5	`48282dc92fcdf293613ee39427652dbc`
BLAKE2b-256	`e4eb91e54d851cbbe1bbff1f6d276fcf9b2bb261e49c10e7f7036785f76f6848`

See more details on using hashes here.

File details

Details for the file document_extraction_module-0.1.1-py3-none-any.whl.

File metadata

Download URL: document_extraction_module-0.1.1-py3-none-any.whl
Upload date: Jul 7, 2025
Size: 19.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for document_extraction_module-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0eb8b606802200e73c8574d32bf8d985a7b333f446b68d096269a6d27050cc9`
MD5	`1c659d0564df7986b72c9d6dc5ab2b2e`
BLAKE2b-256	`13f94fbcfa6f2a7b204070429d0eccb990e4ae2c4daa5bf6fdfeba79f1981c8f`

See more details on using hashes here.

document-extraction-module 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TODO: Update README file and add docstrings for the module. Also add instructions to build the package

📄 Document Extraction Module

🚀 Features

🔐 API Key Setup

📦 Installation

🧠 Define Your Extraction Schema

📂 Usage

📁 Output

💡 Tips

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes