PyMuPDF Utilities for LLM/RAG
Project description
PyMuPDF4LLM
PyMuPDF4LLM is a lightweight extension for PyMuPDF that turns documents into clean, structured data with minimal setup. It includes layout analysis without any GPU requirement.
PyMuPDF4LLM makes it easy to extract document content in the format you need for LLM & RAG environments. It supports structured data extraction to Markdown, JSON and TXT , as well as LlamaIndex and LangChain integration.
Features
- Parsing of multiple document formats.
- Export structured data as Markdown, JSON and plain text output formats.
- Support for multi-column pages.
- Support for image and vector graphics extraction.
- Layout analysis for better semantic understanding of document structure.
- Support for page chunking output.
- Integration with popular AI frameworks.
Installation
$ pip install -U pymupdf4llm
This command will automatically install or upgrade PyMuPDF as required.
Execution
Markdown
import pymupdf4llm
# The remainder of the script is unchanged
md_text = pymupdf4llm.to_markdown("input.pdf")
# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_text(md_text)
JSON
import pymupdf4llm
json_text = pymupdf4llm.to_json("input.pdf")
# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.json").write_text(json_text)
Plain Text
import pymupdf4llm
plain_text = pymupdf4llm.to_text("input.pdf")
# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.txt").write_text(plain_text)
Documentation
Check out the PyMuPDF4LLM documentation, for details on installation, features, sample code and the full API.
Examples
Find our examples on GitHub.
Integrations
For your AI application development, check out our integrations with popular frameworks.
Support
You can get support for PyMuPDF4LLM via a number of options:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymupdf4llm-1.27.2.2.tar.gz.
File metadata
- Download URL: pymupdf4llm-1.27.2.2.tar.gz
- Upload date:
- Size: 72.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f95e113d434958f8c63393c836fe965ad398d1fc07e7807c0a627c9ec1946e9f
|
|
| MD5 |
2710047555fc1686b1d4c08f2c95a98c
|
|
| BLAKE2b-256 |
f0e78b97bf223ea2fd72efd862af3210ae3aa2fb15b39b55767de9e0a2fd0985
|
File details
Details for the file pymupdf4llm-1.27.2.2-py3-none-any.whl.
File metadata
- Download URL: pymupdf4llm-1.27.2.2-py3-none-any.whl
- Upload date:
- Size: 84.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec3bbceed21c6f86289155f29c557aa54ae1c8282c4a45d6de984f16fb4c90cb
|
|
| MD5 |
cd84f0d0c93ccc01c5b4459a389b79ab
|
|
| BLAKE2b-256 |
01fca4977b84f9a7e70aac4c9beed55d4693b985cef89fab7d49c896335bf158
|