Skip to main content

PyMuPDF Utilities for LLM/RAG

Project description

PyMuPDF logo

PyMuPDF4LLM

Docs License MIT PyPI Downloads Discord

PyMuPDF4LLM is a lightweight extension for PyMuPDF that turns documents into clean, structured data with minimal setup. It includes layout analysis without any GPU requirement.

PyMuPDF4LLM makes it easy to extract document content in the format you need for LLM & RAG environments. It supports structured data extraction to Markdown, JSON and TXT , as well as LlamaIndex and LangChain integration.

Features

  • Parsing of multiple document formats.
  • Export structured data as Markdown, JSON and plain text output formats.
  • Support for multi-column pages.
  • Support for image and vector graphics extraction.
  • Layout analysis for better semantic understanding of document structure.
  • Support for page chunking output.
  • Integration with popular AI frameworks.

Installation

$ pip install -U pymupdf4llm

This command will automatically install or upgrade PyMuPDF as required.

Execution

Markdown

import pymupdf4llm

# The remainder of the script is unchanged
md_text = pymupdf4llm.to_markdown("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_text(md_text)

JSON

import pymupdf4llm

json_text = pymupdf4llm.to_json("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.json").write_text(json_text)

Plain Text

import pymupdf4llm

plain_text = pymupdf4llm.to_text("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.txt").write_text(plain_text)

Documentation

Check out the PyMuPDF4LLM documentation, for details on installation, features, sample code and the full API.

Examples

Find our examples on GitHub.

Integrations

For your AI application development, check out our integrations with popular frameworks.

Support

You can get support for PyMuPDF4LLM via a number of options:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymupdf4llm-1.27.2.2.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm-1.27.2.2-py3-none-any.whl (84.3 kB view details)

Uploaded Python 3

File details

Details for the file pymupdf4llm-1.27.2.2.tar.gz.

File metadata

  • Download URL: pymupdf4llm-1.27.2.2.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pymupdf4llm-1.27.2.2.tar.gz
Algorithm Hash digest
SHA256 f95e113d434958f8c63393c836fe965ad398d1fc07e7807c0a627c9ec1946e9f
MD5 2710047555fc1686b1d4c08f2c95a98c
BLAKE2b-256 f0e78b97bf223ea2fd72efd862af3210ae3aa2fb15b39b55767de9e0a2fd0985

See more details on using hashes here.

File details

Details for the file pymupdf4llm-1.27.2.2-py3-none-any.whl.

File metadata

  • Download URL: pymupdf4llm-1.27.2.2-py3-none-any.whl
  • Upload date:
  • Size: 84.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pymupdf4llm-1.27.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec3bbceed21c6f86289155f29c557aa54ae1c8282c4a45d6de984f16fb4c90cb
MD5 cd84f0d0c93ccc01c5b4459a389b79ab
BLAKE2b-256 01fca4977b84f9a7e70aac4c9beed55d4693b985cef89fab7d49c896335bf158

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page