Skip to main content

Python library to abbreviate a PDF file to GPT 8k prompt length

Project description

PDFtoPrompt

Existing libraries for using GPT-4 to extract information from a PDF fil typically combine GPT-4 with word searching, indexing, and segmentation. Those strategies work reasonably, but they have one significant limitation: they deprive the LLM of "big picture" context.

PDFtoPrompt takes a different strategy. Inspired by Twitter user @gfodor's experiments with text compression, it uses GPT-4 to compress or distill a PDF file's entire informational content to below the length limit of a single ChatGPT prompt.

It achieves this by first calculating what compression factor is needed to get the text to the right length, then segmenting the PDF file and asking GPT-4 to compress each segment, and finally stitching the compressed segments back together. You should then be able to fit the full compressed text into a single ChatGPT prompt, with some room left over to ask a question.

The process is, as Twitter user @gfodor notes, pretty "lossy." especially for longer texts. This tool may be best used in combination with others that use other strategies.

Installation

  1. Install with pip:
pip install pdftoprompt

Usage

Make sure to first set your GPT-4-approved OpenAI API key with the set_openai_api_key function:

from pdftoprompt import set_openai_api_key

set_openai_api_key()

This function either takes your API key as a string argument or looks in the .env file in the current working directory to see if you have an OPENAI_API_KEY variable stored there. I recommend saving your API key in the .env file for your project so you can share your code without worrying about key security. If you're uploading code to GitHub, make sure to add .env to .gitignore.

Next, import the compress_pdf function from the pdf_compressor library, and call it with the PDF url or file path:

from pdf_compressor import compress_pdf

file_path = "path/to/your/pdf/file.pdf"
use_ocr = True  # Set to True if you want to use OCR

compressed_text = compress_pdf(file_path, use_ocr)
print(compressed_text)

In theory you should be able to use OCR and an optional use_ocr argument (default is False)

Install Tesseract OCR and add it to your system path.

  1. Set an environment variable GPT_API_KEY with your OpenAI API key.

Usage

Contributing

If you'd like to contribute to this library, please submit a pull request on GitHub. We welcome any improvements, bug fixes, or new features.

License

This library is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftoprompt-0.1.0.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

pdftoprompt-0.1.0-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file pdftoprompt-0.1.0.tar.gz.

File metadata

  • Download URL: pdftoprompt-0.1.0.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.1 CPython/3.9.13 Windows/10

File hashes

Hashes for pdftoprompt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b568bf25083608d6aa328304aac30c378d78c3ddc29f6d33c81b4d7caaedf8af
MD5 7b6584d8d39a0edb678fb15feb679441
BLAKE2b-256 6a506d58868e0fe6f37149808770187774b3d3ef7ba4039ad810e3202191c7c5

See more details on using hashes here.

File details

Details for the file pdftoprompt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdftoprompt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.1 CPython/3.9.13 Windows/10

File hashes

Hashes for pdftoprompt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f306bdcc42c0c740b159255dc21c1335bdf7c121e3247ef02512194f3adede51
MD5 126751e79ce1cb857ad4c22856ffdf76
BLAKE2b-256 c6e9a1fa3ac283f23dec36283a465217193c1dfd72146a47057740f57ae77818

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page