Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

A "blazingly-fast" PDF extractor in C using MuPDF, inspired by pymupdf4llm. I took many of its heuristics and approach but rewrote it in C, then bound it to Python so it's easy to use.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

speed: ~300 pages/second on CPU. 1 million pages in ~55 minutes.

AMD Ryzen 7 4800H (8 cores, 6 used), ~1600-page, table & text heavy document.

Capabilities/comparisons to others tools here.

Primarily intended for use with Python bindings.


Installation

pip install pymupdf4llm-c

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


Capabilities

Tool Speed (pps) Tables Images (Figures) OCR (Y/N) JSON Output Best For
pymupdf4llm-C ~300 Yes No (WIP) N Yes (structured) RAG, high volume
pymupdf4llm ~10 Yes Yes (but not ML to get contents) N Markdown General extraction
pymupdf (alone) ~250 No No, not by itself, requires more effort I believe N No (text only) basic text extraction
marker ~0.5-1 Yes Yes (contents with ML?) Y (optional?) Markdown Maximum fidelity
docling ~2-5 Yes Yes Y JSON Document intelligence
PaddleOCR ~20-50 Yes Yes Y Text Scanned documents

Trade-off: speed and control vs automatic extraction. Marker and Docling give higher fidelity if you have time.

what it handles well

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

what it doesn't handle

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from pymupdf4llm_c import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"{block.type}: {block.text if hasattr(block, 'text') else ''}")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (psuedo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

Any trade-offs due to the speed gains; you must have lost some fidelity from pymupdf4llm? If we're talking trade-offs in comparison to PyMuPDF4LLM:

Not as much as you'd think.

The reason for PyMuPDF4LLM being so slow wasn't due to its quality. It was an inefficient code-base. O(n^2) algorithms, raw numbers in Python, pretty much just unoptimized code and a bad language for lots of maths.

This isn't a trade-off of the project itself, but there may still be minor cases where I haven't 100% copied the heuristics.

If we're talking about trade-offs in comparison to tools like Paddle, Marker & Docling:

It does not do any fancy ML. It's just some basic geometric maths. Therefore it won't handle:

  • scanned pages; no OCR
  • & complex tables or tables without some form of edges

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.6.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.4-cp314-cp314-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

pymupdf4llm_c-1.6.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.4-cp313-cp313-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

pymupdf4llm_c-1.6.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.4-cp312-cp312-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

pymupdf4llm_c-1.6.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.4-cp311-cp311-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

pymupdf4llm_c-1.6.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.4-cp310-cp310-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

pymupdf4llm_c-1.6.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.4-cp39-cp39-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.6.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 05f9f0e400d4baea6f8381d244080959d8219a5ca4e927f4eb0dc6b83687509e
MD5 c582a2eae63daf67db55ae9f018c92ab
BLAKE2b-256 943a1badabdeaa6282abdc4ad60f00e77efa32929b4bfd55439145eff490bec5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d6cfd984b8d2f092c27a3c7fd695a96066b2ff5a04963531124e12e828689ce9
MD5 f6638a89dba4d2bb3a8ba9057bc1781b
BLAKE2b-256 70d8b07b15544c71df3766d0a6dabd81ad634147786c3e2816b8bad4c6fe5b71

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a3408a111173966f78be4babb901fbd03f623ed39b83d151a4e415d3e7b6ef9a
MD5 daab711914c7cffe548f7b6aec19dc7e
BLAKE2b-256 cb99cca1433a172ae539ce73790f75aabff1c160b3dfabbbe1f9ef98bb5bfae4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 244ea614d91fc088d8aac20c9df0aac9983ccb1614cd7dcf41693b71c18fbac9
MD5 17e9d4d2b7aa2efd8f208edabd6d75d0
BLAKE2b-256 4543bf7e62a9a568234f3801049abb769c4c70bc380cfcf9f18addc2b58c8454

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 04dee6fea194ae7516c4e8a308df65c51756f1566c6455b4950de9f9f0a2a07f
MD5 57bbe99ab22233605e9e1b916d742b2b
BLAKE2b-256 4107360dba9cd6e9798c76574b899ebdfef2bf2a991d910d1f1c06aaadc6d0b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 8c19c9acabf820751c6dfb229cccc580a1a2e3b454c9e4abb6a9f81425b7773a
MD5 a8485f2b52308c251b454a44c7f3caa9
BLAKE2b-256 54c54b92fe3e44117f30b5687c2116d29fdfedbeec9cefc9df11ec6fd4e5cc0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 87fe0ebb3fcbfa079039d7e748991bee8e0a9b4dec707afc42bfc7a8546f0b31
MD5 f70281af934ce8f0909ba00c0df1606b
BLAKE2b-256 bb6602f0a219cb9212c09661db92ad0bb5622b669a644e4337a963d06e807638

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 47fe54a3c72923ab08dd9f2fda737a2e7cea89a0e67e2d821cc2de138b1deaea
MD5 997a009ec46706e271bd06758b37646a
BLAKE2b-256 92533f0aa6cf8dcb8609f148f5762c0955c74c4bbdff92f538963a32d528366a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 16549b95a73b8c7487cb7c830a300e8974933849cef5f0a05ec1f20e76b04f7c
MD5 d1fdfd35ce5a53db4dd57bcb0d1236c1
BLAKE2b-256 ee807c628b4bf33038a9f5bab54038335e65d0c49338a49bd07a03a0eff5c438

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 eb2a96c6d53c6f1237263ba8618fef2ae5aa5036f79618d23dfa40d7194c2796
MD5 3150a488c266e9e8de47b37f75fa76af
BLAKE2b-256 ed27167c1cdaa4d77de9bb98b13c45d3b8b837c6abfd5042cc638740352be521

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 11adb430aef5ab90906fbccfea6e1e6fe6786e6596e226def3d120fe86a9f5fe
MD5 b5d6bd67d91d21b22aac8be2a30ba4d5
BLAKE2b-256 f3519650913f04370eadd28f6bc6c0966aecc9e3d2dfc95b17c35fb50fa7c7b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.4-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.4-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 5b58d2e1f874ef26a50308f9b86ade76e0152b73acdd39d81367a84a6e7c8c8c
MD5 329ca0219fedd48db00b15f78b2c31d2
BLAKE2b-256 d14fb0bcabc388d01a2bad96ed2afe1b911f7676718c435d8a697ed5bbbfeffb

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.4-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page