Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

A "blazingly-fast" PDF extractor in C using MuPDF, inspired by pymupdf4llm. I took many of its heuristics and approach but rewrote it in C, then bound it to Python so it's easy to use.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

speed: ~300 pages/second on CPU. 1 million pages in ~55 minutes.

AMD Ryzen 7 4800H (8 cores, 6 used), ~1600-page, table & text heavy document.

Capabilities/comparisons to others tools here.

Primarily intended for use with Python bindings.


Installation

pip install pymupdf4llm-c

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


Capabilities

Tool Speed (pps) Tables Images (Figures) OCR (Y/N) JSON Output Best For
pymupdf4llm-C ~300 Yes No (WIP) N Yes (structured) RAG, high volume
pymupdf4llm ~10 Yes Yes (but not ML to get contents) N Markdown General extraction
pymupdf (alone) ~250 No No, not by itself, requires more effort I believe N No (text only) basic text extraction
marker ~0.5-1 Yes Yes (contents with ML?) Y (optional?) Markdown Maximum fidelity
docling ~2-5 Yes Yes Y JSON Document intelligence
PaddleOCR ~20-50 Yes Yes Y Text Scanned documents

Trade-off: speed and control vs automatic extraction. Marker and Docling give higher fidelity if you have time.

what it handles well

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

what it doesn't handle

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from pymupdf4llm_c import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"{block.type}: {block.text if hasattr(block, 'text') else ''}")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (psuedo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

Any trade-offs due to the speed gains; you must have lost some fidelity from pymupdf4llm? If we're talking trade-offs in comparison to PyMuPDF4LLM:

Not as much as you'd think.

The reason for PyMuPDF4LLM being so slow wasn't due to its quality. It was an inefficient code-base. O(n^2) algorithms, raw numbers in Python, pretty much just unoptimized code and a bad language for lots of maths.

This isn't a trade-off of the project itself, but there may still be minor cases where I haven't 100% copied the heuristics.

If we're talking about trade-offs in comparison to tools like Paddle, Marker & Docling:

It does not do any fancy ML. It's just some basic geometric maths. Therefore it won't handle:

  • scanned pages; no OCR
  • & complex tables or tables without some form of edges

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.6.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.2-cp314-cp314-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

pymupdf4llm_c-1.6.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.2-cp313-cp313-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

pymupdf4llm_c-1.6.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.2-cp312-cp312-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

pymupdf4llm_c-1.6.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.2-cp311-cp311-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

pymupdf4llm_c-1.6.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.2-cp310-cp310-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

pymupdf4llm_c-1.6.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.2-cp39-cp39-macosx_15_0_arm64.whl (79.4 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.6.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3247941fcb9868f0dfa64c19a31d281bd71ed966d26a08ad92b80421fa7bea23
MD5 c981e0ad507a493c64f2d1dbd940d5bb
BLAKE2b-256 2ce6a182a9af612f622a425a661d86ecd7a3494f13a0833b55bdf9079ec49d58

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 9079715467289c26cbb5bf8fde64b5672f4706d40614c8ff6e3df270c6a72c7b
MD5 7bff6f4071cfea08c9f1d94841e20fc2
BLAKE2b-256 b9aea2456dfe195673fc7372ce49a8bd277e5669fe648d4f06eb6cb97d8f3ad9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5b544829e8c055b601b6498254c75a61f33454e65a4c5d4aee254bf5f47169b8
MD5 6fda1bfea6498cde909ec2f70e93a8fd
BLAKE2b-256 441bcd4cc49310f4f59794ea5ad98cd9d5cc1f12e60c5697cf686abce3f0582f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 615fe60b2dbf3bcff22fb4ee92ff836033a6612368740c650f6c138d7e028be7
MD5 4132ee9ed1cb7a328663500dbcce4faa
BLAKE2b-256 174bc66a57966ec9c096f52eac05430a4859ce2722c64335c55d67960922a89f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cb6d7b8982feb541d2e52bb2a5b197ffb2a5b656e37b0f9383c155da009b54dc
MD5 2fe5a46a27b886456803a36b295a1b40
BLAKE2b-256 cb7c22917edb43b43f900876e32e4d69d6c445a121e9d918a59c0c6412cf3b0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7c4442d9b63aa9063fbbc172b6c2d039d5a3af6015caaa76c817ab33856ce49d
MD5 58ab71e43f608f40dde2d25d1c016ee8
BLAKE2b-256 3eb3a51be35303f18efb183df7298c22b66e7b8ec1f7928e8ab726dccb4b1b3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6551330546d9eb82438ef54df7c4ef0bdcc6aa6154c94cd2e84a2bdbca5862a6
MD5 31002bc9c5e4cb83734c468b64bcc2ff
BLAKE2b-256 92a0ed20aff5dd77e84ed5018782b305b8bde0d7541f28e5dd50eb1997fd413f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 c3ddbec7fbe980f9972d1468e0ede9a89f16f04bd066cb0e5fd0cd79f0b4965e
MD5 93ddcc9356de4cc3b81fb25cb5d7170e
BLAKE2b-256 7a231ed818c10e07e003179e0c5e419efc656584ee11afb9ed1f8d4622d9a0d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fb24b5ac90624e839b33c47e277c7ab178c80d908725ee888ebae4fbd2a6bff4
MD5 65cc801a4e8da12e2e50ad88717010b7
BLAKE2b-256 25eccff862377178611bba856a834eac034106026a9ff610f1ef51a786028b65

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 3e73f260379e39fa51433465b55f71c65fbaf0702d76f69ec6a62c4e39a5930e
MD5 c92f6c59feb4986bdf27b4f02b75aaa9
BLAKE2b-256 bd91c25a200e012822e11286378f12b094bcfd5cd98f2ab7120810758366a0a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1e2e04b4932c45dc8d3a5a31d9863fd455311a98eae5b3c67e93160e362d019c
MD5 4ad5aad100732999ec71e4f533e90c39
BLAKE2b-256 02db748ac8a86d17415874dc0554835ed48714e82862f8fc53d54967b7d0f11e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.2-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.2-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 e89ea5405de20392620e76c441ec0e6b819d2d3731aa2d036c1b010946ae2461
MD5 f2262addafb01d98a5e84a2b988263e4
BLAKE2b-256 1e805695e3efe4d3248f08b710d54fbc46e2c5d52f279c77a83176ea1eaaf09c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.2-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page