Skip to main content

Using GPT to parse PDF

Project description

gptpdf

CN doc EN doc

Using VLLM (like GPT-4o) to parse PDF into markdown.

Our approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.

Average cost per page: $0.013

This package use GeneralAgent lib to interact with OpenAI API.

Process steps

  1. Use the PyMuPDF library to parse the PDF to find all non-text areas and mark them, for example:

  1. Use a large visual model (such as GPT-4o) to parse and get a markdown file.

DEMO

See examples/attention_is_all_you_need/output.md for PDF examples/attention_is_all_you_need.pdf.

Installation

pip install gptpdf

Usage

from gptpdf import parse_pdf
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(pdf_path, api_key=api_key)
print(content)

See more in test/test.py

API

parse_pdf(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False)

parse pdf file to markdown file, and return markdown content and all image paths.

  • pdf_path: pdf file path

  • output_dir: output directory. store all images and markdown file

  • api_key: OpenAI API Key (optional). If not provided, Use OPENAI_API_KEY environment variable.

  • base_url: OpenAI Base URL. (optional). If not provided, Use OPENAI_BASE_URL environment variable.

  • model: OpenAI Vision Large Model, default is 'gpt-4o'. You also can use qwen-vl-max (not tested yet) GLM-4V by change the OPENAI_BASE_URL or specify base_url. Also you can use Azure OpenAI by specify base_url to https://xxxx.openai.azure.com/, api_key is Azure API Key, model is like 'azure_xxxx' where xxxx is the deployed model name (not openai model name)

  • verbose: verbose mode

  • gpt_worker: gpt parse worker number. default is 1. If your machine performance is good, you can increase it appropriately to improve parsing speed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptpdf-0.0.6.tar.gz (6.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page