PyMuPDF Utilities for LLM/RAG
Project description
Using PyMuPDF as Data Feeder in LLM / RAG Applications
This package converts the pages of a PDF to text in Markdown format using PyMuPDF.
Standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.
Header lines are identified via the font size and appropriately prefixed with one or more '#' tags.
Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.
By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.
Installation
$ pip install -U pymupdf4llm
Then in your script do
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf", pages=None)
# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())
Instead of the filename string as above, one can also provide a PyMuPDF Document
. The pages
parameter may be a list of 0-based page numbers or None
(the default) whch includes all pages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pymupdf4llm-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2c8bf6817b15b2a7d76b1a243f3e4e36f95c8a3c8eb571b445c617e080d3391 |
|
MD5 | 72ebe0114a1c8523f7b3164debbdc841 |
|
BLAKE2b-256 | 5beb59261c583122102bf39e65f3f4e4d0405e16c5f8fe3a890f421ee1254f57 |