PyMuPDF Utilities for LLM/RAG
Project description
Using PyMuPDF in an RAG (Retrieval-Augmented Generation) Chatbot Environment
This repository contains examples showing how PyMuPDF can be used as a data feed for RAG-based chatbots.
Examples include scripts that start chatbots - either as simple CLI programs in REPL mode or browser-based GUIs. Chatbot scripts follow this general structure:
- Extract Text: Use PyMuPDF to extract text from one or more pages from one or more PDFs. Depending on the specific requirement this may be all text or only text contained in tables, the Table of Contents, etc. This will generally be implemented as one or more Python functions called by any of the following events - which implement the actual chatbot functionality.
- Indexing the Extracted Text: Index the extracted text for efficient retrieval. This index will act as the knowledge base for the chatbot.
- Query Processing: When a user asks a question, process the query to determine the key information needed for a response.
- Retrieving Relevant Information: Search your indexed knowledge base for the most relevant pieces of information related to the user's query.
- Generating a Response: Use a generative model to generate a response based on the retrieved information.
Installation
As a specialty, folder "helpers" contains a script that is capable to convert PDF pages into text strings in Markdown format, which includes standard text as well as table-based text in a consistent and integrated view. This is especially important in RAG environments.
There exists a Python package on PyPI pdf4llm which provides easy access to this script:
$ pip install -U pdf4llm
Then in your script do
import pdf4llm
md_text = pdf4llm.to_markdown("input.pdf", pages=page_numbers)
# work with the markdown text
Instead of the filename string as above, you can also provide a PyMuPDF Document
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf4llm-0.0.5.tar.gz
.
File metadata
- Download URL: pdf4llm-0.0.5.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f8d4ce0e6806404a19123512bfb5570ee4ce45a9ce4f090c5f37656f7adddfb |
|
MD5 | f6e790a3ca47e8e77b07d8d25f43f276 |
|
BLAKE2b-256 | b38a6117efd1975b436335f511dad4a7e1cc1e3f86ede8278b8eb2524785816f |
File details
Details for the file pdf4llm-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: pdf4llm-0.0.5-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7fce56c97c5381cc5513d3d2d72f2c7e4ff35f3de9c93ea6da14e57a206565e3 |
|
MD5 | 70d924e1030d21fa35e9e277dd72f773 |
|
BLAKE2b-256 | 8d2651d0f50b6e48cd9affb91dd2c83d6123601e40fe9a31c8a4051382e13ca7 |