Skip to main content

PyMuPDF Utilities for LLM/RAG

Project description

Using PyMuPDF in an RAG (Retrieval-Augmented Generation) Chatbot Environment

This repository contains examples showing how PyMuPDF can be used as a data feed for RAG-based chatbots.

Examples include scripts that start chatbots - either as simple CLI programs in REPL mode or browser-based GUIs. Chatbot scripts follow this general structure:

  1. Extract Text: Use PyMuPDF to extract text from one or more pages from one or more PDFs. Depending on the specific requirement this may be all text or only text contained in tables, the Table of Contents, etc. This will generally be implemented as one or more Python functions called by any of the following events - which implement the actual chatbot functionality.
  2. Indexing the Extracted Text: Index the extracted text for efficient retrieval. This index will act as the knowledge base for the chatbot.
  3. Query Processing: When a user asks a question, process the query to determine the key information needed for a response.
  4. Retrieving Relevant Information: Search your indexed knowledge base for the most relevant pieces of information related to the user's query.
  5. Generating a Response: Use a generative model to generate a response based on the retrieved information.

As a specialty, folder "helpers" contains a script that is capable to convert PDF pages into text strings in Markdown format, which includes standard text as well as table-based text in a consistent and integrated view. This is especially important in RAG environments.

Installation

The repository can be installed under the name "pdf4llm" (PDF for LLM) in the usual way pip install -U pdf4llm.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf4llm-0.0.1.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

pdf4llm-0.0.1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf4llm-0.0.1.tar.gz.

File metadata

  • Download URL: pdf4llm-0.0.1.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pdf4llm-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0a8ba42bd992a8feed49291d675eb988e42cacc09f6f9ce53905e156a68cc823
MD5 ba3e541890e2953b9a235400c4cbef5b
BLAKE2b-256 010d53b777b6be5623e82ea9fa62ed6f28517a2c798fce4766910d5b7a67e4c1

See more details on using hashes here.

File details

Details for the file pdf4llm-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdf4llm-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pdf4llm-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 17e41aa5dc3f1985b13f25a39dabcd935e2caaa738448db85552897658d31128
MD5 5c65482b4cd8eaf07b7a3977d91cb45a
BLAKE2b-256 cdff0ee2fe9ed9d5fb10629fc13082a2bc45d3514993e4ed4cde67e04ae52549

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page