Skip to main content

PyMuPDF Utilities for LLM/RAG

Project description

Using PyMuPDF in an RAG (Retrieval-Augmented Generation) Chatbot Environment

This repository contains examples showing how PyMuPDF can be used as a data feed for RAG-based chatbots.

Examples include scripts that start chatbots - either as simple CLI programs in REPL mode or browser-based GUIs. Chatbot scripts follow this general structure:

  1. Extract Text: Use PyMuPDF to extract text from one or more pages from one or more PDFs. Depending on the specific requirement this may be all text or only text contained in tables, the Table of Contents, etc. This will generally be implemented as one or more Python functions called by any of the following events - which implement the actual chatbot functionality.
  2. Indexing the Extracted Text: Index the extracted text for efficient retrieval. This index will act as the knowledge base for the chatbot.
  3. Query Processing: When a user asks a question, process the query to determine the key information needed for a response.
  4. Retrieving Relevant Information: Search your indexed knowledge base for the most relevant pieces of information related to the user's query.
  5. Generating a Response: Use a generative model to generate a response based on the retrieved information.

Installation

As a specialty, folder "helpers" contains a script that is capable to convert PDF pages into text strings in Markdown format, which includes standard text as well as table-based text in a consistent and integrated view. This is especially important in RAG environments.

There exists a Python package on PyPI pdf4llm which provides easy access to this script:

$ pip install -U pdf4llm

Then in your script do

import pdf4llm

md_text = pdf4llm.to_markdown("input.pdf", pages=page_numbers)

# work with the markdown text

Instead of the filename string as above, you can also provide a PyMuPDF Document.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf4llm-0.0.5.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

pdf4llm-0.0.5-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file pdf4llm-0.0.5.tar.gz.

File metadata

  • Download URL: pdf4llm-0.0.5.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pdf4llm-0.0.5.tar.gz
Algorithm Hash digest
SHA256 8f8d4ce0e6806404a19123512bfb5570ee4ce45a9ce4f090c5f37656f7adddfb
MD5 f6e790a3ca47e8e77b07d8d25f43f276
BLAKE2b-256 b38a6117efd1975b436335f511dad4a7e1cc1e3f86ede8278b8eb2524785816f

See more details on using hashes here.

File details

Details for the file pdf4llm-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: pdf4llm-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pdf4llm-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 7fce56c97c5381cc5513d3d2d72f2c7e4ff35f3de9c93ea6da14e57a206565e3
MD5 70d924e1030d21fa35e9e277dd72f773
BLAKE2b-256 8d2651d0f50b6e48cd9affb91dd2c83d6123601e40fe9a31c8a4051382e13ca7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page