Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

fast PDF extractor in C using MuPDF. Outputs structured JSON with layout metadata. ~300 pages/second.

primarily intended for use with python bindings. but for some reason i got bored and added Rust ones too if ya want.


what this is

a PDF extractor in C using MuPDF, inspired by pymupdf4llm. i took many of its heuristics and approach but rewrote it in C for speed, then bound it to Python and Rust so it's easy to use.

outputs JSON for every block: text, type, bounding box, font metrics, tables. you get the raw data to process however you need.

speed: ~300 pages/second on CPU. 1 million pages in ~55 minutes.


the problem

most extractors give you raw text (fast but useless) or over-engineered solutions (slow, opinionated, not built for what you need). you want structured data. you want to know where things are, what they are, whether they're headers or body text. and you want this fast if you're processing large volumes.


what you get

JSON with geometry, typography, and structure. use bounding boxes to find natural document boundaries. detect headers and footers by coordinates. reconstruct tables properly. you decide what to do with it.

{
  "type": "heading",
  "text": "Step 1. Gather threat intelligence",
  "bbox": [64.00, 173.74, 491.11, 218.00],
  "font_size": 21.64,
  "font_weight": "bold"
}

instead of splitting on word count and getting mid-sentence breaks, you use layout to chunk semantically.


comparison

Tool Speed (pps) Quality Tables JSON Use Case
pymupdf4llm-C ~300 Good Yes Structured High volume, full control
pymupdf4llm ~10 Good Yes Markdown General
pymupdf ~250 Subpar No Text only Basic extraction
marker ~0.5-1 Excellent Yes Markdown Maximum accuracy
docling ~2-5 Excellent Yes JSON Document intelligence
PaddleOCR ~20-50 Good (OCR) Yes Text Scanned documents

tradeoff: speed and control vs automatic extraction. marker and docling give higher fidelity if you have time.


what it handles well

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

what it doesn't handle

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

installation

pip install pymupdf4llm-c

or Rust:

cargo add pymupdf4llm-c

wheels for Python 3.10–3.13 on macOS (ARM/x64) and Linux (glibc > 2.11). no Windows; see BUILD.md to compile.


usage

Python

basic

from pymupdf4llm_c import to_json

output_file = to_json("example.pdf")
print(f"Extracted to: {output_file}")

collect in memory

pages = to_json("report.pdf", collect=True)

for page_obj in pages:
    blocks = page_obj.get("data", [])
    for block in blocks:
        print(f"{block.get('type')}: {block.get('text', '')}")

large files (streaming)

from pymupdf4llm_c import iterate_json_pages

for page_blocks in iterate_json_pages("large.pdf"):
    for block in page_blocks:
        print(f"Block type: {block['type']}")

per-page files

json_files = to_json(pdf_path, output_dir="output_json")

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]
Rust
use pymupdf4llm_c::{to_json, to_json_collect, PdfError};

fn main() -> Result<(), PdfError> {
    let paths = to_json("example.pdf", None)?;
    println!("Generated {} files", paths.len());

    let pages = to_json_collect("example.pdf", None)?;
    println!("Parsed {} pages", pages.len());

    Ok(())
}

output structure

each page is a JSON array of blocks:

[
  {
    "type": "heading",
    "text": "Introduction",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 21.64,
    "font_weight": "bold",
    "page_number": 0
  },
  {
    "type": "paragraph",
    "text": "This document describes...",
    "bbox": [72.0, 140.5, 523.5, 200.2],
    "font_size": 12.0,
    "page_number": 0
  },
  {
    "type": "table",
    "bbox": [72.0, 220.0, 523.5, 400.0],
    "row_count": 3,
    "col_count": 2,
    "rows": [
      {
        "cells": [
          { "text": "Header A", "bbox": [72.0, 220.0, 297.75, 250.0] },
          { "text": "Header B", "bbox": [297.75, 220.0, 523.5, 250.0] }
        ]
      }
    ]
  }
]

fields: type (text, heading, paragraph, table, list, code), bbox (x0, y0, x1, y1), font_size, font_weight, spans (when styled).


faq

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL v3 or with a license from Artifex (MuPDF's creators). see LICENSE.


building from source

see BUILD.md.


development

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

build native extractor, then:

./lint.sh
pytest

license

AGPL v3. commercial use requires license from Artifex.


links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.1-cp312-cp312-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pymupdf4llm_c-1.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.1-cp311-cp311-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pymupdf4llm_c-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.1-cp310-cp310-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pymupdf4llm_c-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.1-cp39-cp39-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cd0b5b18d3e0e6f4c465557f734bbec6ecc08967feb188b1d3c0e12f321ee5ac
MD5 e03138192def100e7230d5bf3e704585
BLAKE2b-256 09ab04cef531a7bd4ddf323eb3d20e768ddb0576922d9b41876fe2a76236e9cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b0c83770704ea3e2f25584748314b2557ea0778816463135e0207a9ebc2b096a
MD5 2f6e3e84d19d39ce5ac9c45a7072360e
BLAKE2b-256 1fce725f48b79c83eccdd61375c8afbd1c074c219b87edec64f2cda61a6a89a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 321e2c0d3fae01e14a2e1e11627107ebf3a93fc74c235d252ab0341a0cc5dfd6
MD5 d1d16d4ecc4ae8e4999f3aa94bdd0fe2
BLAKE2b-256 611ce638803e3a76986c44815ebc005b2bf9b8f96eac2fbff977d5238741a9f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b722be909620ea243f89abce64b18d501314a25f1ec4508cb07edd7b1d9467e5
MD5 a904d61523d0527cc94644a3952625b0
BLAKE2b-256 9161ca959c7beba8d7adbb39a2c2cf233c1309e59641c975ae3a55a3470c5f15

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fb40bba9e83ddb67cb540c4254b0d675875253e574857cb33a1c886731db4097
MD5 c3e63a7c637459b656172d619e2234c3
BLAKE2b-256 0c59579f5ec3f92f2c1447151c46f75bddce5a41f850cc877b99e31b75b7cac6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b727afa8d374050afa9a9c92005f641a0bc1898d7cc3eebd5034e82510d82084
MD5 4065c2334f4d1d228b317c59dfb147be
BLAKE2b-256 c13c37b0a86d3caba8c0e0aa37c58fb3fad7deb973b48e47c8462a2e259ccdbf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 32b91d36078753db88691f8cb1a741c2682db90bd9ed21a5ed8ec0ef85e7368b
MD5 a624d67fb81cc19cd541ea4fae5cbd40
BLAKE2b-256 9ed346deca7c109d934abeb81171fa3461fddb399de375ce4325df4974e35d79

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e95a9866a4159117f52a62a1bdae521bc6dfd71146d240e3c7c10c085a930424
MD5 d34d3657ddf0daa2c5ad9d908448e2d4
BLAKE2b-256 9b07a34851fd0adcc31ded1b8a606397ff8f7268eeb28641d19098bb63ef3b41

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.1-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page