PyMuPDF Utilities for LLM/RAG
Project description
Using PyMuPDF as Data Feeder in LLM / RAG Applications
This package converts the pages of a PDF to text in Markdown format using PyMuPDF.
Standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.
Header lines are identified via the font size and appropriately prefixed with one or more '#' tags.
Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.
By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.
Installation
$ pip install -U pdf4llm
Then in your script do
import pdf4llm
md_text = pdf4llm.to_markdown("input.pdf", pages=None)
# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())
Instead of the filename string as above, one can also provide a PyMuPDF Document
. The pages
parameter may be a list of 0-based page numbers or None
(the default) whch includes all pages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf4llm-0.0.6.tar.gz
.
File metadata
- Download URL: pdf4llm-0.0.6.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff87deebff2e0a27c5b333a6d7bf9bfd18f7846b91097eb6b699b5ac5e55d110 |
|
MD5 | dd3d41b22f292ab8e01ab2c7b81a2914 |
|
BLAKE2b-256 | 831dc802e1195f631014e8f345f89621bb7c2fa79e17731d8862aceb67cf978b |
File details
Details for the file pdf4llm-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: pdf4llm-0.0.6-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbaf1420902ea95d80b9e70c117aa3d3ba2c781c50e883463e675cd1b268f650 |
|
MD5 | 206734e33b146a423b103ac9528a5092 |
|
BLAKE2b-256 | 26c779daee7d84c79ffcceba5725dcf93afae64688e2fa524057044b7b063e48 |