Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

A "blazingly-fast" PDF extractor in C using MuPDF, inspired by pymupdf4llm. I took many of its heuristics and approach but rewrote it in C, then bound it to Python so it's easy to use.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

speed: ~300 pages/second on CPU. 1 million pages in ~55 minutes.

AMD Ryzen 7 4800H (8 cores, 6 used), ~1600-page, table & text heavy document.

Capabilities/comparisons to others tools here.

Primarily intended for use with Python bindings.


Installation

pip install pymupdf4llm-c

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


Capabilities

Tool Speed (pps) Tables Images (Figures) OCR (Y/N) JSON Output Best For
pymupdf4llm-C ~300 Yes No (WIP) N Yes (structured) RAG, high volume
pymupdf4llm ~10 Yes Yes (but not ML to get contents) N Markdown General extraction
pymupdf (alone) ~250 No No, not by itself, requires more effort I believe N No (text only) basic text extraction
marker ~0.5-1 Yes Yes (contents with ML?) Y (optional?) Markdown Maximum fidelity
docling ~2-5 Yes Yes Y JSON Document intelligence
PaddleOCR ~20-50 Yes Yes Y Text Scanned documents

Trade-off: speed and control vs automatic extraction. Marker and Docling give higher fidelity if you have time.

what it handles well

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

what it doesn't handle

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from pymupdf4llm_c import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"{block.type}: {block.text if hasattr(block, 'text') else ''}")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (psuedo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

Any trade-offs due to the speed gains; you must have lost some fidelity from pymupdf4llm? If we're talking trade-offs in comparison to PyMuPDF4LLM:

Not as much as you'd think.

The reason for PyMuPDF4LLM being so slow wasn't due to its quality. It was an inefficient code-base. O(n^2) algorithms, raw numbers in Python, pretty much just unoptimized code and a bad language for lots of maths.

This isn't a trade-off of the project itself, but there may still be minor cases where I haven't 100% copied the heuristics.

If we're talking about trade-offs in comparison to tools like Paddle, Marker & Docling:

It does not do any fancy ML. It's just some basic geometric maths. Therefore it won't handle:

  • scanned pages; no OCR
  • & complex tables or tables without some form of edges

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.6.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.1-cp314-cp314-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

pymupdf4llm_c-1.6.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.1-cp313-cp313-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

pymupdf4llm_c-1.6.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.1-cp312-cp312-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

pymupdf4llm_c-1.6.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.1-cp311-cp311-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

pymupdf4llm_c-1.6.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.1-cp310-cp310-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

pymupdf4llm_c-1.6.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (77.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.1-cp39-cp39-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.6.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 65b5aaec22c4837e268260161f28b268799c59f4f969f736af910f74b6a3f9f1
MD5 e220a8765f2b872f30cd381c94629983
BLAKE2b-256 a966af97931bff3e89a3a417907c7ba8688572ae9f212ec240bb3ed6d038541f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 a877462f6681430bc540c140ef8a631abaa3981121a69f7f7ca7e36057e2e657
MD5 e95788075856e8714c5f62a9aaff12e0
BLAKE2b-256 bd597d3c6b0f6d06698081af08fe44b44cc1bedd3604b7ae0abfc559c3892aa6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 08008ffc50723857a118b92262a31fdce31c70e2527a85e2410d98780af5cd49
MD5 e7d393b805af2231574db945bf4ac0c4
BLAKE2b-256 7a02e673472af72a39b2b25982e4995ed68eae339c55debcb79649d85bec5ccd

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 36d92dfac89d622b1bc565964068fbe4d0fa977b04611c0520b13f238bb69105
MD5 308ed48c092cc580c1107cebe148e8a6
BLAKE2b-256 77f3a4d048efe3696481cb3e4289dbb2099fdb7e488065f3504ec488f2a14bcf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a36576af8f089b704b4023a5b6ede5e8c7d672213a0b59d0e8ac916d79d44ed7
MD5 7eb4040a6aaded6c6028eb8dc7413594
BLAKE2b-256 b39a33a4c01c46f655b86bb03e8edaa4783116f33966280a478c663c2180fa83

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7ab832ab21e2cc96ae88cc8c897aef5d232bca24c3c5ab95c2832e318ee46bb3
MD5 360a96662fc8a6592c643df44c2105fe
BLAKE2b-256 91b954093a11eac29f5fabdadb0e9774ba1180e9af7924354a8db445704c608f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e6947b2713d600a4fc4240c4c5025e14d3f5d0059a08c784eab6275e3f2783a5
MD5 8f5acb5436ca673bb49c7020e629e9d9
BLAKE2b-256 e10710fac4ce5b7b73988e9996f67189cadc34b6315ab12a66a25d727b3cca24

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 b3b2a92d75c00633369f77c366b03dfe8d7c2c89e773775d9013d06557b43d3b
MD5 c4bb95d8424b784c2ddb86638fc1c3ae
BLAKE2b-256 300c604f7ff3038b8912f914ead39198829268cb703b683e704f8e6b8779396e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 efc5ab351ad0151c3acc79d280e64b3c76fa112211531295001deb12612a04df
MD5 24a093f780e9614bb3bdd582deb4c8a3
BLAKE2b-256 ae976d0c9b94ca8834e7bc42862248da61c8b54b645f426a7f49ef180c88f255

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 8212857ba2ec2d251966d9a34aeefe47e315eda062aa2822cdc574b97ee7d4ad
MD5 06f68a7e9a945fe144ea669b5d5aac61
BLAKE2b-256 86895a02cfda47e5450afd6e1505120a7b45bf60a88dfe2a5db86804df50d1eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b9e01dbd255a606ad19d4406e99f76c81fd1759633315f3c502bd0b0e0fb79f9
MD5 ee3678a1c75a56ec72ffee5e2f4aeb71
BLAKE2b-256 d360a1b45bb595870debafb0efe45073b1665a33f059668ca59fc08dc862119b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.1-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.1-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2d72b496c3741ff9a0baafe902bef091db07a24aa0003fb3a84cecc7787747e1
MD5 5247cd8fa3011bef2db3ceb65e5cb956
BLAKE2b-256 1d764124a620e6ea09f5d96bf3a24d43a404ecf7997e4e5bae28b09285709eec

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.1-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page