Skip to main content

Using LLM to parse PDF and get better chunk for retrieval

Project description

LLMDocParser

A package for parsing PDFs and analyzing their content using LLMs.

This package is an improvement based on the concept of gptpdf.

Method

gptpdf uses PyMuPDF to parse PDFs, identifying both text and non-text regions. It then merges or filters the text regions based on certain rules, and inputs the final results into a multimodal model for parsing. This method is particularly effective.

Based on this concept, I made some minor improvements.

Main Process

Using a layout analysis model, each page of the PDF is parsed to identify the type of each region, which includes Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, and Equation. The coordinates of each region are also obtained.

Layout Analysis Result Example:

[{'header': ((101, 66, 436, 102), 0)},
 {'header': ((1038, 81, 1088, 95), 1)},
 {'title': ((106, 215, 947, 284), 2)},
 {'text': ((101, 319, 835, 390), 3)},
 {'text': ((100, 565, 579, 933), 4)},
 {'text': ((100, 967, 573, 1025), 5)},
 {'text': ((121, 1055, 276, 1091), 6)},
 {'reference': ((101, 1124, 562, 1429), 7)},
 {'text': ((610, 565, 1089, 930), 8)},
 {'text': ((613, 976, 1006, 1045), 9)},
 {'title': ((612, 1114, 726, 1129), 10)},
 {'text': ((611, 1165, 1089, 1431), 11)},
 {'title': ((1011, 1471, 1084, 1492), 12)}]

This result includes the type, coordinates, and reading order of each region. By using this result, more precise rules can be set to parse the PDF.

Finally, input the images of the corresponding regions into a multimodal model, such as GPT-4o or Qwen-VL, to directly obtain text blocks that are friendly to RAG solutions.

filepath type page_no filename content
output/page_1_title.png Title 1 attention is all you need [Text Block 1]
output/page_1_text.png Text 1 attention is all you need [Text Block 2]
output/page_2_figure.png Figure 2 attention is all you need [Text Block 3]
output/page_2_figure_caption.png Figure caption 2 attention is all you need [Text Block 4]
output/page_3_table.png Table 3 attention is all you need [Text Block 5]
output/page_3_table_caption.png Table caption 3 attention is all you need [Text Block 6]
output/page_1_header.png Header 1 attention is all you need [Text Block 7]
output/page_2_footer.png Footer 2 attention is all you need [Text Block 8]
output/page_3_reference.png Reference 3 attention is all you need [Text Block 9]
output/page_1_equation.png Equation 1 attention is all you need [Text Block 10]

See more in llm_parser.py main function.

Installation

pip install llmdocparser

Usage

from llmdocparser.llm_parser import get_image_content

content = get_image_content(
    llm_type="azure",
    pdf_path="path/to/your/pdf",
    output_dir="path/to/output/directory",
    max_concurrency=5,
    azure_deployment="azure-gpt-4o",
    azure_endpoint="your_azure_endpoint",
    api_key="your_api_key",
    api_version="your_api_version"
)
print(content)

Parameters

  • llm_type: str

    The options are azure, openai, dashscope.

  • pdf_path: str

    Path to the PDF file.

  • output_dir: str

    Output directory to store all parsed images.

  • max_concurrency: int

    Number of GPT parsing worker threads. Batch calling details: Batch Support

If using Azure, the azure_deployment and azure_endpoint parameters need to be passed; otherwise, only the API key needs to be provided.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmdocparser-0.1.3.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

llmdocparser-0.1.3-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file llmdocparser-0.1.3.tar.gz.

File metadata

  • Download URL: llmdocparser-0.1.3.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for llmdocparser-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ba2705ca1a74a2647cd9a707fe9c0455ccb5a13234012aab9e64d43e6ee531ec
MD5 6c308041544d0cb8e23772d4ed2a385e
BLAKE2b-256 09c1e182055b89630702aea4404535bb60426f244fd89687f16e58164ecabc84

See more details on using hashes here.

File details

Details for the file llmdocparser-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for llmdocparser-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f19600ee3b69b01dc3a523d2a1804c6e10cfd2ddfa30cbbeb7a92e87e494d4fc
MD5 f5a7e2a1418bb7b240f7281060ce1a3a
BLAKE2b-256 cfc3de54f02e2f5e87a690917fde7b810ebca79d5443b6af0c048fd7ee2ed4d7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page