Using GPT to parse PDF

These details have not been verified by PyPI

Project links

Project description

gptpdf

Using VLLM (like GPT-4o) to parse PDF into markdown.

Our approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.

Average cost per page: $0.013

This package use GeneralAgent lib to interact with OpenAI API.

pdfgpt-ui is a visual tool based on gptpdf.

Process steps

Use the PyMuPDF library to parse the PDF to find all non-text areas and mark them, for example:

Use a large visual model (such as GPT-4o) to parse and get a markdown file.

DEMO

examples/attention_is_all_you_need/output.md for PDF examples/attention_is_all_you_need.pdf.
examples/rh/output.md for PDF examples/rh.pdf.

Installation

pip install gptpdf

Usage

Local Usage

from gptpdf import parse_pdf
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(pdf_path, api_key=api_key)
print(content)

See more in test/test.py

Google Colab

see examples/gptpdf_Quick_Tour.ipynb

API

parse_pdf

Function:

def parse_pdf(
        pdf_path: str,
        output_dir: str = './',
        api_key = None,
        base_url = None,
        model = 'gpt-4o',
        gpt_worker: int = 1,
        prompt = DEFAULT_PROMPT,
        rect_prompt = DEFAULT_RECT_PROMPT,
        role_prompt = DEFAULT_ROLE_PROMPT,
) -> Tuple[str, List[str]]:

Parses a PDF file into a Markdown file and returns the Markdown content along with all image paths.

Parameters:

pdf_path: str
Path to the PDF file
output_dir: str, default: './'
Output directory to store all images and the Markdown file
api_key: str
OpenAI API key. If not provided through this parameter, it must be set via the OPENAI_API_KEY environment variable.
base_url: str, optional
OpenAI base URL. If not provided through this parameter, it must be set via the OPENAI_BASE_URL environment variable. Can be used to configure custom OpenAI API endpoints.
model: str, default: 'gpt-4o'
OpenAI API formatted multimodal large model. If you need to use other models.
gpt_worker: int, default: 1
Number of GPT parsing worker threads. If your machine has better performance, you can increase this value to speed up the parsing.
prompt: str, default: uses built-in prompt
Custom main prompt used to guide the model on how to process and convert text content in images.
rect_prompt: str, default: uses built-in prompt
Custom rectangle area prompt used to handle cases where specific areas (such as tables or images) are marked in the image.

role_prompt: str, default: uses built-in prompt
Custom role prompt that defines the role of the model to ensure it understands it is performing a PDF document parsing task.

You can customize these prompts to adapt to different models or specific needs, for example:

content, image_paths = parse_pdf(
    pdf_path=pdf_path,
    output_dir='./output',
    model="gpt-4o",
    prompt="Custom main prompt",
    rect_prompt="Custom rectangle area prompt",
    role_prompt="Custom role prompt",
    verbose=False,
)

Join Us 👏🏻

Scan the QR code below with WeChat to join our group chat or contribute.

wechat

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 18, 2025

0.1.0

Apr 18, 2025

0.0.15

Jul 24, 2024

0.0.14

Jul 17, 2024

0.0.13

Jul 10, 2024

0.0.12

Jul 9, 2024

0.0.11

Jul 9, 2024

0.0.10

Jul 8, 2024

0.0.9

Jul 5, 2024

0.0.8

Jul 3, 2024

0.0.7

Jul 2, 2024

0.0.6

Jul 2, 2024

0.0.5

Jul 1, 2024

0.0.4

Jul 1, 2024

0.0.3

Jul 1, 2024

0.0.2

Jun 29, 2024

0.0.1

Jun 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptpdf-0.1.1.tar.gz (7.3 kB view details)

Uploaded Apr 18, 2025 Source

File details

Details for the file gptpdf-0.1.1.tar.gz.

File metadata

Download URL: gptpdf-0.1.1.tar.gz
Upload date: Apr 18, 2025
Size: 7.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.9.12 Darwin/22.1.0

File hashes

Hashes for gptpdf-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`51f233ef28772739990ba0b2e0181c7ad786b32ddb74642c4dde20016b74ff07`
MD5	`41fd3f4ea9b3a083e3c189df287a5750`
BLAKE2b-256	`68c6db91903056c828781f929fac8704fd9da6158593e5fedebd9131b056fde5`

See more details on using hashes here.

gptpdf 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gptpdf

Process steps

DEMO

Installation

Usage

Local Usage

Google Colab

API

parse_pdf

Join Us 👏🏻

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes