Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

A "blazingly-fast" PDF extractor in C using MuPDF, inspired by pymupdf4llm. I took many of its heuristics and approach but rewrote it in C, then bound it to Python so it's easy to use.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

speed: ~300 pages/second on CPU. 1 million pages in ~55 minutes.

AMD Ryzen 7 4800H (8 cores, 6 used), ~1600-page, table & text heavy document.

Capabilities/comparisons to others tools here.

Primarily intended for use with Python bindings.


Installation

pip install pymupdf4llm-c

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


Capabilities

Tool Speed (pps) Tables Images (Figures) OCR (Y/N) JSON Output Best For
pymupdf4llm-C ~300 Yes No (WIP) N Yes (structured) RAG, high volume
pymupdf4llm ~10 Yes Yes (but not ML to get contents) N Markdown General extraction
pymupdf (alone) ~250 No No, not by itself, requires more effort I believe N No (text only) basic text extraction
marker ~0.5-1 Yes Yes (contents with ML?) Y (optional?) Markdown Maximum fidelity
docling ~2-5 Yes Yes Y JSON Document intelligence
PaddleOCR ~20-50 Yes Yes Y Text Scanned documents

Trade-off: speed and control vs automatic extraction. Marker and Docling give higher fidelity if you have time.

what it handles well

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

what it doesn't handle

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from pymupdf4llm_c import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"{block.type}: {block.text if hasattr(block, 'text') else ''}")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (psuedo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

Any trade-offs due to the speed gains; you must have lost some fidelity from pymupdf4llm? If we're talking trade-offs in comparison to PyMuPDF4LLM:

Not as much as you'd think.

The reason for PyMuPDF4LLM being so slow wasn't due to its quality. It was an inefficient code-base. O(n^2) algorithms, raw numbers in Python, pretty much just unoptimized code and a bad language for lots of maths.

This isn't a trade-off of the project itself, but there may still be minor cases where I haven't 100% copied the heuristics.

If we're talking about trade-offs in comparison to tools like Paddle, Marker & Docling:

It does not do any fancy ML. It's just some basic geometric maths. Therefore it won't handle:

  • scanned pages; no OCR
  • & complex tables or tables without some form of edges

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.6.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.0-cp314-cp314-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

pymupdf4llm_c-1.6.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.0-cp313-cp313-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

pymupdf4llm_c-1.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.0-cp312-cp312-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

pymupdf4llm_c-1.6.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.0-cp311-cp311-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

pymupdf4llm_c-1.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.0-cp310-cp310-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

pymupdf4llm_c-1.6.0-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.6.0-cp39-cp39-macosx_15_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.6.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0b1b5e8caa876834e596f846fe570dccc9bc91ab932ffaec638e0ab3123ae65c
MD5 5848f2caff6a8890a04057cc04a7bc51
BLAKE2b-256 55c910044bdcdc13a0ed48fdc548959d54faba054124c0682f3f65353d8eb748

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 b3286f34cc4f69bec067124515587cbed8e62e06b6574004c53468a83657fcc8
MD5 08636f1722f2d36024e10fa0a3c936fe
BLAKE2b-256 97d3c47c56187f552267b5f41ccec17f5c3612f400a6f80bd6a2da3ead49ad28

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 be3b93931d2c6735220b6b10c9982885ecb7c9053075c06c56079578ce130b3e
MD5 8a3e0cc1c020216ce336c42a6ae8b767
BLAKE2b-256 f55502feaa599d7bf7b3e4af7c55d8f40c8df3cbbc5a47ba23a90f3aee4585bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 b111dad3f3ba742463dd99378cd5c311b0038e91291c9668373449454d9157d4
MD5 af29b97cc912223878de6d719e41e239
BLAKE2b-256 796c53c09abfba226e961ce6612544e609817afebbc0d35c430372a77390e66c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b1a1b59665fbbc0310a658e42c5de3edf3c42962336dad72e628bcb6aad66d9e
MD5 2bedb0a34363f48ffcd4778c757557fe
BLAKE2b-256 4def9803bfe605b6c0996a7ec67564978cffdcc60865f549b9fc5dc843d459e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 852792cdada6fe823a55dfd76b9fce2aaf9436b28bd04fdab02b8bfc94179a13
MD5 bda42239815195278adcb4002c9b3f64
BLAKE2b-256 b60292a4358c26e398378e1e61c803055f230c2129e13abefad630110d619ea5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fe2eae8e1082a6f6470d1dff8f69e12c88598ee5ae16d4c0f5d1d7d347cb5034
MD5 c2a919ba21c32f0ec71c04190c957a8b
BLAKE2b-256 9d7b304aa7ebfc2c897e312c47b026b35202a104ef6b735e700be0c957900748

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6d379d786e27af0ed532bb31c26c118c850249a44a744410ad0f069e3d084f65
MD5 2959aa6863224d771a86d72a39b91406
BLAKE2b-256 84492fd93bd9efe86c0ca9834a74149f8a79b3cd6eb272f59b551cd59635a7b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5f0f2956d1bc02754d493d6b7680afce65b4e46e3e9b7dd02d18a409a5a83898
MD5 f2ef607ac285b8af69ab1a96cd0d152c
BLAKE2b-256 510919df543dd7a497c2f23e4bb74d2a33ecdac6335cdd4bc30f90f2775606ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 af0d05a676922022bcd5458345229d1a2d2880a85323c22016357ab8006efbca
MD5 f580313eaf8a33fb6d0e52b56dd0bb29
BLAKE2b-256 574455731fb363074f334a0ffe6cf2838d3e4bbb4ed73c1538da56b7a3f3c3dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 18c60536a40378e691bb0db892799317085ab789a26a9393d8ea11c8a4628aa7
MD5 3a937990f357cfd1ce1e80f07ea7af89
BLAKE2b-256 bd841097837e4cee48b0ca51a358645490cbc643c60502b61d78b518157889ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.6.0-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.6.0-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7f1fb91115c8932e1e53a8b320d119d4300ea83177050b0b6a231a74d2f8bd13
MD5 71d4e2dbe69b3f7ea4eed0cce312f53f
BLAKE2b-256 8d0f4856c033e6f4860cbd3830610a0c92718ef94db186d4a1daa01d12c08a68

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.6.0-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page