Python library to abbreviate a PDF file to GPT 8k prompt length
Project description
PDFtoPrompt
Existing libraries for using GPT-4 to extract information from a PDF fil typically combine GPT-4 with word searching, indexing, and segmentation. Those strategies work reasonably, but they have one significant limitation: they deprive the LLM of "big picture" context.
PDFtoPrompt takes a different strategy. Inspired by Twitter user @gfodor's experiments with text compression, it uses GPT-4 to compress or distill a PDF file's entire informational content to below the length limit of a single ChatGPT prompt.
It achieves this by first calculating what compression factor is needed to get the text to the right length, then segmenting the PDF file and asking GPT-4 to compress each segment, and finally stitching the compressed segments back together. You should then be able to fit the full compressed text into a single ChatGPT prompt, with some room left over to ask a question.
The process is, as Twitter user @gfodor notes, pretty "lossy." especially for longer texts. This tool may be best used in combination with others that use other strategies.
Installation
- Install with pip:
pip install pdftoprompt
Usage
Make sure to first set your GPT-4-approved OpenAI API key with the set_openai_api_key function:
from pdftoprompt import set_openai_api_key
set_openai_api_key()
This function either takes your API key as a string argument or looks in the .env file in the current working directory to see if you have an OPENAI_API_KEY variable stored there. I recommend saving your API key in the .env file for your project so you can share your code without worrying about key security. If you're uploading code to GitHub, make sure to add .env to .gitignore.
Next, import the compress_pdf
function from the pdf_compressor
library, and call it with the PDF url or file path:
from pdf_compressor import compress_pdf
file_path = "path/to/your/pdf/file.pdf"
use_ocr = True # Set to True if you want to use OCR
compressed_text = compress_pdf(file_path, use_ocr)
print(compressed_text)
In theory you should be able to use OCR and an optional use_ocr
argument (default is False
)
Install Tesseract OCR and add it to your system path.
- Set an environment variable
GPT_API_KEY
with your OpenAI API key.
Usage
Contributing
If you'd like to contribute to this library, please submit a pull request on GitHub. We welcome any improvements, bug fixes, or new features.
License
This library is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdftoprompt-0.1.0.tar.gz
.
File metadata
- Download URL: pdftoprompt-0.1.0.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.1 CPython/3.9.13 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b568bf25083608d6aa328304aac30c378d78c3ddc29f6d33c81b4d7caaedf8af |
|
MD5 | 7b6584d8d39a0edb678fb15feb679441 |
|
BLAKE2b-256 | 6a506d58868e0fe6f37149808770187774b3d3ef7ba4039ad810e3202191c7c5 |
File details
Details for the file pdftoprompt-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: pdftoprompt-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.1 CPython/3.9.13 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f306bdcc42c0c740b159255dc21c1335bdf7c121e3247ef02512194f3adede51 |
|
MD5 | 126751e79ce1cb857ad4c22856ffdf76 |
|
BLAKE2b-256 | c6e9a1fa3ac283f23dec36283a465217193c1dfd72146a47057740f57ae77818 |