Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

fast PDF extractor in C using MuPDF. Outputs structured JSON with layout metadata. ~300 pages/second.

primarily intended for use with python bindings. but for some reason i got bored and added Rust ones too if ya want.


what this is

a PDF extractor in C using MuPDF, inspired by pymupdf4llm. i took many of its heuristics and approach but rewrote it in C for speed, then bound it to Python and Rust so it's easy to use.

outputs JSON for every block: text, type, bounding box, font metrics, tables. you get the raw data to process however you need.

speed: ~300 pages/second on CPU. 1 million pages in ~55 minutes.


the problem

most extractors give you raw text (fast but useless) or over-engineered solutions (slow, opinionated, not built for what you need). you want structured data. you want to know where things are, what they are, whether they're headers or body text. and you want this fast if you're processing large volumes.


what you get

JSON with geometry, typography, and structure. use bounding boxes to find natural document boundaries. detect headers and footers by coordinates. reconstruct tables properly. you decide what to do with it.

{
  "type": "heading",
  "text": "Step 1. Gather threat intelligence",
  "bbox": [64.00, 173.74, 491.11, 218.00],
  "font_size": 21.64,
  "font_weight": "bold"
}

instead of splitting on word count and getting mid-sentence breaks, you use layout to chunk semantically.


comparison

Tool Speed (pps) Quality Tables JSON Use Case
pymupdf4llm-C ~300 Good Yes Structured High volume, full control
pymupdf4llm ~10 Good Yes Markdown General
pymupdf ~250 Subpar No Text only Basic extraction
marker ~0.5-1 Excellent Yes Markdown Maximum accuracy
docling ~2-5 Excellent Yes JSON Document intelligence
PaddleOCR ~20-50 Good (OCR) Yes Text Scanned documents

tradeoff: speed and control vs automatic extraction. marker and docling give higher fidelity if you have time.


what it handles well

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

what it doesn't handle

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

installation

pip install pymupdf4llm-c

or Rust:

cargo add pymupdf4llm-c

wheels for Python 3.10–3.13 on macOS (ARM/x64) and Linux (glibc > 2.11). no Windows; see BUILD.md to compile.


usage

Python

basic

from pymupdf4llm_c import to_json

output_file = to_json("example.pdf")
print(f"Extracted to: {output_file}")

collect in memory

pages = to_json("report.pdf", collect=True)

for page_obj in pages:
    blocks = page_obj.get("data", [])
    for block in blocks:
        print(f"{block.get('type')}: {block.get('text', '')}")

large files (streaming)

from pymupdf4llm_c import iterate_json_pages

for page_blocks in iterate_json_pages("large.pdf"):
    for block in page_blocks:
        print(f"Block type: {block['type']}")

per-page files

json_files = to_json(pdf_path, output_dir="output_json")

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]
Rust
use pymupdf4llm_c::{to_json, to_json_collect, PdfError};

fn main() -> Result<(), PdfError> {
    let paths = to_json("example.pdf", None)?;
    println!("Generated {} files", paths.len());

    let pages = to_json_collect("example.pdf", None)?;
    println!("Parsed {} pages", pages.len());

    Ok(())
}

output structure

each page is a JSON array of blocks:

[
  {
    "type": "heading",
    "text": "Introduction",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 21.64,
    "font_weight": "bold",
    "page_number": 0
  },
  {
    "type": "paragraph",
    "text": "This document describes...",
    "bbox": [72.0, 140.5, 523.5, 200.2],
    "font_size": 12.0,
    "page_number": 0
  },
  {
    "type": "table",
    "bbox": [72.0, 220.0, 523.5, 400.0],
    "row_count": 3,
    "col_count": 2,
    "rows": [
      {
        "cells": [
          { "text": "Header A", "bbox": [72.0, 220.0, 297.75, 250.0] },
          { "text": "Header B", "bbox": [297.75, 220.0, 523.5, 250.0] }
        ]
      }
    ]
  }
]

fields: type (text, heading, paragraph, table, list, code), bbox (x0, y0, x1, y1), font_size, font_weight, spans (when styled).


faq

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL v3 or with a license from Artifex (MuPDF's creators). see LICENSE.


building from source

see BUILD.md.


development

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

build native extractor, then:

./lint.sh
pytest

license

AGPL v3. commercial use requires license from Artifex.


links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f111c32475f3fde89f60defe2e0c0c7bc9b37650299edfdf226989f716ba5f44
MD5 d80dffd8237e0b95b7c46edf3d08203b
BLAKE2b-256 c6318eacf24180cd54e392a762ba124d6e8f993a5f46e5626156e785ce40a130

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8acb9b08a77531d8e6dbda34493c2438de259bdec38616d584c6074255029c1d
MD5 93a9cd7f673aa8d2d7cda5cbe9268d98
BLAKE2b-256 1b78319792c665f5f38c2f79350fac338afb1806ecbbfdcaa60b34c61526bc6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 994db6be0fdcd8b398f3069835931d6da2293f71275ce18775610e930aef7d2b
MD5 22b262bcdc26995814fa5e87220201fe
BLAKE2b-256 c335cf1613391ad1e7b46ce7370946daea6e0be5f4303c09d60bafd19147c418

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3269890237b29dfde102349027b9d4a522aea27e0b20457e8841be1ab28bc048
MD5 c0d9fd828962a2e69cf29e9323b8e818
BLAKE2b-256 26ccfdfe0d2227507fee28f4a9fdc0613261beda24536fe0333403a7dbb1267e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 66a1525a448e477aa57bd0919eb36d186fdc646f4101c250bf487e99bd26d3cf
MD5 68e593974da099284e35591c2ac021df
BLAKE2b-256 129bbe51c04746869737221e83c55f391020627a24b09caf2ce75805dfabf39b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c3ba3506705571ec9d45c1c65f2cf28bcc4bf8c549349de447e7a21ce589204
MD5 4c471cbfd9d10cdf1eb9a13b01b2da9c
BLAKE2b-256 cd40a469aaacee0b3e5995a212ff25c99bd78bb0ddc26c9ec4207c32a61fb44c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fddf04f3c6cae4e120bccfe89688bd485e0c452ee97243a5637ca6fa67b2da15
MD5 fe04fe89919da151a7f045ce1def0534
BLAKE2b-256 6ce5870dd0581cb00d45b56c825871bfb380138302e3b50b8d592adaa8ec03a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0170136adfdf485250cd9ff98d7c7335eb4b9beb05f5f41a60328df76b580083
MD5 a6e8c5b8164549cd5593465593f200ae
BLAKE2b-256 0a31e56886e9e85cad2e21bb6feb7e44483b4482c118ccd7c51fd8ee2d941727

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page