Using GPT to parse PDF
Project description
gptpdf
Using VLLM (like GPT-4o) to parse PDF into markdown.
Our approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.
Average cost per page: $0.013
This package use GeneralAgent lib to interact with OpenAI API.
pdfgpt-ui is a visual tool based on gptpdf.
Process steps
- Use the PyMuPDF library to parse the PDF to find all non-text areas and mark them, for example:
- Use a large visual model (such as GPT-4o) to parse and get a markdown file.
DEMO
-
examples/attention_is_all_you_need/output.md for PDF examples/attention_is_all_you_need.pdf.
-
examples/rh/output.md for PDF examples/rh.pdf.
Installation
pip install gptpdf
Usage
Local Usage
from gptpdf import parse_pdf
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(pdf_path, api_key=api_key)
print(content)
See more in test/test.py
Google Colab
see examples/gptpdf_Quick_Tour.ipynb
API
parse_pdf
Function:
def parse_pdf(
pdf_path: str,
output_dir: str = './',
api_key = None,
base_url = None,
model = 'gpt-4o',
gpt_worker: int = 1,
prompt = DEFAULT_PROMPT,
rect_prompt = DEFAULT_RECT_PROMPT,
role_prompt = DEFAULT_ROLE_PROMPT,
) -> Tuple[str, List[str]]:
Parses a PDF file into a Markdown file and returns the Markdown content along with all image paths.
Parameters:
-
pdf_path: str
Path to the PDF file -
output_dir: str, default: './'
Output directory to store all images and the Markdown file -
api_key: str
OpenAI API key. If not provided through this parameter, it must be set via theOPENAI_API_KEYenvironment variable. -
base_url: str, optional
OpenAI base URL. If not provided through this parameter, it must be set via theOPENAI_BASE_URLenvironment variable. Can be used to configure custom OpenAI API endpoints. -
model: str, default: 'gpt-4o'
OpenAI API formatted multimodal large model. If you need to use other models. -
gpt_worker: int, default: 1
Number of GPT parsing worker threads. If your machine has better performance, you can increase this value to speed up the parsing. -
prompt: str, default: uses built-in prompt
Custom main prompt used to guide the model on how to process and convert text content in images. -
rect_prompt: str, default: uses built-in prompt
Custom rectangle area prompt used to handle cases where specific areas (such as tables or images) are marked in the image. -
role_prompt: str, default: uses built-in prompt
Custom role prompt that defines the role of the model to ensure it understands it is performing a PDF document parsing task.You can customize these prompts to adapt to different models or specific needs, for example:
content, image_paths = parse_pdf( pdf_path=pdf_path, output_dir='./output', model="gpt-4o", prompt="Custom main prompt", rect_prompt="Custom rectangle area prompt", role_prompt="Custom role prompt", verbose=False, )
Join Us 👏🏻
Scan the QR code below with WeChat to join our group chat or contribute.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file gptpdf-0.1.1.tar.gz.
File metadata
- Download URL: gptpdf-0.1.1.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.9.12 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51f233ef28772739990ba0b2e0181c7ad786b32ddb74642c4dde20016b74ff07
|
|
| MD5 |
41fd3f4ea9b3a083e3c189df287a5750
|
|
| BLAKE2b-256 |
68c6db91903056c828781f929fac8704fd9da6158593e5fedebd9131b056fde5
|