Skip to main content

A Python package that utilizes GPT-4V and other tools to convert PDFs into Markdown files.

Project description

gpt_pdf_md

gpt_pdf_md is a Python package that leverages GPT-4V and other tools to convert PDF files into Markdown. The current limitation of raw GPT-4V is that it does not support PDF documents in the API. Additionally, when prompted to convert text containing figures to Markdown, the figures are not converted correctly due to missing image URLs in the Markdown. However, gpt_pdf_md is coming close to the OCR quality of Mathpix!

Features

  • Extracts figures from PDF files using the pdffigures2 Scala library.
  • Converts PDF pages to images and uploads them to a Google Cloud Bucket.
  • Utilizes GPT-4V Vision to generate Markdown content from a PDF and then inserts image URLs into the Markdown.

Additional Dependencies

This package requires the pdffigures2 Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library. You can find more information here. Please note that this can be quite a hassle because parts of the library are written in Scala, so you need to have the correct versions of Java and Scala installed. We are looking for an alternative, more straightforward way to extract images from a PDF. If you have any ideas, feel free to open an issue.

Installation

Once you have pdffigures2 set up, you can install gpt_pdf_md via pip:

pip install gpt-pdf-md

Configure the required environment variables in your .env file without spaces or unnecessary quotes:

OPENAI_API_KEY=open_ai_key
GOOGLE_ID=google_project_id
GOOGLE_BUCKET=google_bucket_name

NOTE: This project requires a public Google bucket where the images, which are later rendered in the Markdown, are uploaded.

Usage

To process a PDF and generate Markdown content, it's important that the Python file is in the same directory as the pdffigures2 folder. You can use gpt_pdf_md as follows:

from gpt_pdf_md.reader import process_pdf
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
GOOGLE_ID = os.getenv('GOOGLE_ID')
GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')

absolute_path = os.path.dirname(os.path.abspath(__file__))
# Absolute path to the PDF file
PDF = absolute_path + "/example.pdf"
# Absolute path to pdffigures2
PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)

This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the output.md file, which is the converted result of example.pdf created by running the example.py script.

Next Steps

  • Try Rust vortex for PDF image extraction
  • Use GPT-4 128k for final formatting of Markdown
  • Create a clearer README to make it easier for everyone to use the Python package
  • Improve error handling

Contributing & Support

We welcome contributions! Please open an issue or submit a pull request on our GitHub repository.

License

This project is licensed under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_pdf_md-0.3.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

gpt_pdf_md-0.3-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file gpt_pdf_md-0.3.tar.gz.

File metadata

  • Download URL: gpt_pdf_md-0.3.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for gpt_pdf_md-0.3.tar.gz
Algorithm Hash digest
SHA256 c6b9a76f523b1b9c7f33ac291ed27f73ea7bcbd535ba09888f47b1632bbe52aa
MD5 f74878844d3e2d9feae8ad148977ae2b
BLAKE2b-256 04346314d83de7d83617d4c637d5d3cd6f5746a1caad35835abf23e8791e9a1d

See more details on using hashes here.

File details

Details for the file gpt_pdf_md-0.3-py3-none-any.whl.

File metadata

  • Download URL: gpt_pdf_md-0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for gpt_pdf_md-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d0d5bb42c341de1215fa94c292884d82c574d828dbcff47fa9bcdbffe19e3445
MD5 41e49198ab9cf9f631efea15c9751075
BLAKE2b-256 8154447abdb1c81862a5897e9b14f2c22367951870ca88fc41fbb7ecd0a20054

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page