Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

This projects C extension has now been rewritten in Go. Performance, quality, and code quality have all improved. However, the Python-API remains the same.

A "blazingly-fast" (oh wait, this isn't in Rust..) PDF extractor for Python written in Go using MuPDF in the backend, inspired by pymupdf4llm. I took many of its heuristics and approaches. Initially, it was supposed to be a 1:1 port (just generating the same Markdown output), but I later pivoted.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

Speed (averaged): ~520 pages/second on CPU. 1 million pages in ~32 minutes.

Full performance breakdown here

Capabilities/comparisons to others tools here.

Primarily intended for use with Python bindings.


Installation

pip install pymupdf4llm-c

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


Capabilities

Tool Speed (pps) Tables Images (Figures) OCR (Y/N) JSON Output Best For
pymupdf4llm-C ~300 Yes No (WIP) N Yes (structured) RAG, high volume
pymupdf4llm ~10 Yes Yes (but not ML to get contents) N Markdown General extraction
pymupdf (alone) ~250 No No, not by itself, requires more effort I believe N No (text only) basic text extraction
marker ~0.5-1 Yes Yes (contents with ML?) Y (optional?) Markdown Maximum fidelity
docling ~2-5 Yes Yes Y JSON Document intelligence
PaddleOCR ~20-50 Yes Yes Y Text Scanned documents

Trade-off: speed and control vs automatic extraction. Marker and Docling give higher fidelity if you have time.

what it handles well

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

what it doesn't handle

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from pymupdf4llm_c import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"{block.type}: {block.text if hasattr(block, 'text') else ''}")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (pseudo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

Any trade-offs due to the speed gains; you must have lost some fidelity from pymupdf4llm? If we're talking trade-offs in comparison to PyMuPDF4LLM:

Not as much as you'd think.

The reason for PyMuPDF4LLM being so slow wasn't due to its quality. It was an inefficient code-base. O(n^2) algorithms, raw numbers in Python, pretty much just unoptimized code and a bad language for lots of maths.

This isn't a trade-off of the project itself, but there may still be minor cases where I haven't 100% copied the heuristics.

If we're talking about trade-offs in comparison to tools like Paddle, Marker & Docling:

It does not do any fancy ML. It's just some basic geometric maths. Therefore it won't handle:

  • scanned pages; no OCR
  • & complex tables or tables without some form of edges

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Performance Breakdown

Using go/cmd/tomd/main.go with input_pdf [output_dir], I measured performance on:

  • ~1600 page document (path not available)
  • ~150 page document (test_data/pdfs/nist.pdf)

Performance depends on document size and available cores. With more pages to saturate your cores, you may see better throughput. Wall-clock time should scale approximately linearly with core count.

Test system: AMD Ryzen 7 4800H (8 cores, 6 used)

Runtime breakdown:

  • Go code: ~25% of runtime
  • MuPDF: ~75% of runtime

On the NIST document (150 pages): Go spent 78ms out of 363ms total (21%), MuPDF spent 285ms (79%).

Calculated average:

  • 1600 pages in 3000ms + 150 pages in 350ms = 1750 pages in 3350ms
  • ~520 pages/second

Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-2.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-2.0.0-cp314-cp314-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

pymupdf4llm_c-2.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-2.0.0-cp313-cp313-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

pymupdf4llm_c-2.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-2.0.0-cp312-cp312-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

pymupdf4llm_c-2.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-2.0.0-cp311-cp311-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

pymupdf4llm_c-2.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-2.0.0-cp310-cp310-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

pymupdf4llm_c-2.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-2.0.0-cp39-cp39-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file pymupdf4llm_c-2.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bc13f9db6d81a22cc4f446134b9bce74f0b0700101bbbe3b382509cc4ebe72b4
MD5 ae6eaf127d2073c7a4185c4993c41e6d
BLAKE2b-256 6bb491829f8715bb30069bb13fbb198885268e0807b2f196afdd671516e8dc1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 fb90b8b94655b80f7bacc67455cfcc6ee9c034770e042e663e8414079cf361ec
MD5 bd6f5b5dad4f863c1fe20d74296f60e7
BLAKE2b-256 cb2db0f468ddbcf3a33cf961b0827bec836f8d1561884f3f1398cf7fec6b7c24

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 67ae3a33c7799b6de6fadebf9d688fdcb2cacb8699f37b71a36f328a1377139f
MD5 677db8ae2914785ea6e01a441998c69f
BLAKE2b-256 5537dd0dc758f3d6371d8304807e42c5a28210cad1f5a7a8907df2986424c708

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2e065aa237d42706fb9046f8194f83ccd3f8c3d240626f87e55c9dc7ece7fa4a
MD5 c9f9a8d1fe69a2a9e29cd815728c7a65
BLAKE2b-256 76f01b9f903a3853e6acecd9b3453640aea21916cc4da7cefb2827237b4e2b34

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 312fc3932c97a618e77308aedf7c2fdfdad7d4f93ae39dd167fcc38f4c8f4f02
MD5 fb784226a400c94545a9287e6f3088f2
BLAKE2b-256 8356122cf135fe8c14d9d0ad53739a58b3316b579f39a9580d4167a44ec180c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 cf5d3770f6387f4f052783ffd506ebc8bcca9f0a879565bb507e1f02cc303ccd
MD5 15df7f22ff39772f46cb18858dcdf936
BLAKE2b-256 65aaf3d235aa5ac3e92429cd8ac5ad3ceab6e642303ed189a2e9521c125013c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a2730073663ba6ae84de5915f60357514a32b004db2f6a13408e82eb4d53d44e
MD5 b961d79e454359715cb768fbcb11ee5e
BLAKE2b-256 3408aec9d39a1f4b4d2e607e7bf5e0040ecb0a1f035ca4a95a9dadd3e981486f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 912be0af7ce5ed4f3e9ebf07fb03607f45b7211dfeb5555397e77de5bc7451ff
MD5 78ba44063776c8c5a2ab9e15449ba280
BLAKE2b-256 fb24de33a6a73e8be07b800ca84d2b50b9cd639b49c431b00c686d9279144556

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 08276b1f85d190b303d03364ae972218b8d0ba9446282f8253909bec9e03b13a
MD5 da17a2efdb51c6ae68add4046bf48fff
BLAKE2b-256 7557e618c6ae0b211a228f70df3434f0778dc78436c9c7413cb5e2a330319e05

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 8998277be80465ad202004693d852772d8ceecc90b32649546aa6ba50536fa1a
MD5 3435f4fffbbbcb724b193157663d94ff
BLAKE2b-256 2be94d04e1370dd3eba163c80afd4825bda4e895965357432f3aa595e50f8d1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6573659a8c8bd152877ccce02f6938615e2f3a08e1e2b3a23138da1dfddb1dad
MD5 a1e1712c170fe427b3711a39579c9517
BLAKE2b-256 52a28735ae13a22500229b9ee9f106c58aab1a990ac4cba2b9e8f30b9e28aebb

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-2.0.0-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-2.0.0-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 07e0192d3731f3ae424deecfdd4fc6f81ec579a21621ab332f60828299f42998
MD5 0d340c1f90b4017abd95c8908aca9873
BLAKE2b-256 a5851d2ac6e9bf164ed6a60a6f2e4ef82c025bab0ad944b0eb67ace339687110

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-2.0.0-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page