Skip to main content

PyMuPDF Utilities for LLM/RAG

Project description

Using PyMuPDF as Data Feeder in LLM / RAG Applications

This package converts the pages of a PDF to text in Markdown format using PyMuPDF.

Standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.

Header lines are identified via the font size and appropriately prefixed with one or more '#' tags.

Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.

By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.

Installation

$ pip install -U pdf4llm

Then in your script do

import pdf4llm

md_text = pdf4llm.to_markdown("input.pdf", pages=None)

# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

Instead of the filename string as above, one can also provide a PyMuPDF Document. The pages parameter may be a list of 0-based page numbers or None (the default) whch includes all pages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf4llm-0.0.6.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

pdf4llm-0.0.6-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf4llm-0.0.6.tar.gz.

File metadata

  • Download URL: pdf4llm-0.0.6.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pdf4llm-0.0.6.tar.gz
Algorithm Hash digest
SHA256 ff87deebff2e0a27c5b333a6d7bf9bfd18f7846b91097eb6b699b5ac5e55d110
MD5 dd3d41b22f292ab8e01ab2c7b81a2914
BLAKE2b-256 831dc802e1195f631014e8f345f89621bb7c2fa79e17731d8862aceb67cf978b

See more details on using hashes here.

File details

Details for the file pdf4llm-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: pdf4llm-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pdf4llm-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 fbaf1420902ea95d80b9e70c117aa3d3ba2c781c50e883463e675cd1b268f650
MD5 206734e33b146a423b103ac9528a5092
BLAKE2b-256 26c779daee7d84c79ffcceba5725dcf93afae64688e2fa524057044b7b063e48

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page