Skip to main content

Using GPT to parse PDF

Project description

gptpdf

CN doc EN doc

Using VLLM (like GPT-4o) to parse PDF into markdown.

Our approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.

Average cost per page: $0.013

This package use GeneralAgent lib to interact with OpenAI API.

pdfgpt-ui is a visual tool based on gptpdf.

Process steps

  1. Use the PyMuPDF library to parse the PDF to find all non-text areas and mark them, for example:

  1. Use a large visual model (such as GPT-4o) to parse and get a markdown file.

DEMO

  1. examples/attention_is_all_you_need/output.md for PDF examples/attention_is_all_you_need.pdf.

  2. examples/rh/output.md for PDF examples/rh.pdf.

Installation

pip install gptpdf

Usage

Local Usage

from gptpdf import parse_pdf
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(pdf_path, api_key=api_key)
print(content)

See more in test/test.py

Google Colab

see examples/gptpdf_Quick_Tour.ipynb

API

parse_pdf

Function:

def parse_pdf(
        pdf_path: str,
        output_dir: str = './',
        api_key = None,
        base_url = None,
        model = 'gpt-4o',
        gpt_worker: int = 1,
        prompt = DEFAULT_PROMPT,
        rect_prompt = DEFAULT_RECT_PROMPT,
        role_prompt = DEFAULT_ROLE_PROMPT,
) -> Tuple[str, List[str]]:

Parses a PDF file into a Markdown file and returns the Markdown content along with all image paths.

Parameters:

  • pdf_path: str
    Path to the PDF file

  • output_dir: str, default: './'
    Output directory to store all images and the Markdown file

  • api_key: str
    OpenAI API key. If not provided through this parameter, it must be set via the OPENAI_API_KEY environment variable.

  • base_url: str, optional
    OpenAI base URL. If not provided through this parameter, it must be set via the OPENAI_BASE_URL environment variable. Can be used to configure custom OpenAI API endpoints.

  • model: str, default: 'gpt-4o'
    OpenAI API formatted multimodal large model. If you need to use other models.

  • gpt_worker: int, default: 1
    Number of GPT parsing worker threads. If your machine has better performance, you can increase this value to speed up the parsing.

  • prompt: str, default: uses built-in prompt
    Custom main prompt used to guide the model on how to process and convert text content in images.

  • rect_prompt: str, default: uses built-in prompt
    Custom rectangle area prompt used to handle cases where specific areas (such as tables or images) are marked in the image.

  • role_prompt: str, default: uses built-in prompt
    Custom role prompt that defines the role of the model to ensure it understands it is performing a PDF document parsing task.

    You can customize these prompts to adapt to different models or specific needs, for example:

    content, image_paths = parse_pdf(
        pdf_path=pdf_path,
        output_dir='./output',
        model="gpt-4o",
        prompt="Custom main prompt",
        rect_prompt="Custom rectangle area prompt",
        role_prompt="Custom role prompt",
        verbose=False,
    )
    

Join Us 👏🏻

Scan the QR code below with WeChat to join our group chat or contribute.

wechat

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptpdf-0.1.1.tar.gz (7.3 kB view details)

Uploaded Source

File details

Details for the file gptpdf-0.1.1.tar.gz.

File metadata

  • Download URL: gptpdf-0.1.1.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.12 Darwin/22.1.0

File hashes

Hashes for gptpdf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 51f233ef28772739990ba0b2e0181c7ad786b32ddb74642c4dde20016b74ff07
MD5 41fd3f4ea9b3a083e3c189df287a5750
BLAKE2b-256 68c6db91903056c828781f929fac8704fd9da6158593e5fedebd9131b056fde5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page