C-backed PDF to structured JSON extractor.
Project description
PyMuPDF4LLM-C
fast PDF extractor in C using MuPDF. Outputs structured JSON with layout metadata. ~300 pages/second.
primarily intended for use with python bindings. but for some reason i got bored and added Rust ones too if ya want.
what this is
a PDF extractor in C using MuPDF, inspired by pymupdf4llm. i took many of its heuristics and approach but rewrote it in C for speed, then bound it to Python and Rust so it's easy to use.
outputs JSON for every block: text, type, bounding box, font metrics, tables. you get the raw data to process however you need.
speed: ~300 pages/second on CPU. 1 million pages in ~55 minutes.
the problem
most extractors give you raw text (fast but useless) or over-engineered solutions (slow, opinionated, not built for what you need). you want structured data. you want to know where things are, what they are, whether they're headers or body text. and you want this fast if you're processing large volumes.
what you get
JSON with geometry, typography, and structure. use bounding boxes to find natural document boundaries. detect headers and footers by coordinates. reconstruct tables properly. you decide what to do with it.
{
"type": "heading",
"text": "Step 1. Gather threat intelligence",
"bbox": [64.00, 173.74, 491.11, 218.00],
"font_size": 21.64,
"font_weight": "bold"
}
instead of splitting on word count and getting mid-sentence breaks, you use layout to chunk semantically.
comparison
| Tool | Speed (pps) | Quality | Tables | JSON | Use Case |
|---|---|---|---|---|---|
| pymupdf4llm-C | ~300 | Good | Yes | Structured | High volume, full control |
| pymupdf4llm | ~10 | Good | Yes | Markdown | General |
| pymupdf | ~250 | Subpar | No | Text only | Basic extraction |
| marker | ~0.5-1 | Excellent | Yes | Markdown | Maximum accuracy |
| docling | ~2-5 | Excellent | Yes | JSON | Document intelligence |
| PaddleOCR | ~20-50 | Good (OCR) | Yes | Text | Scanned documents |
tradeoff: speed and control vs automatic extraction. marker and docling give higher fidelity if you have time.
what it handles well
- millions of pages, fast
- custom parsing logic; you own the rules
- document archives, chunking strategies, any structured extraction
- CPU only; no expensive inference
- iterating on parsing logic without waiting hours
what it doesn't handle
- scanned or image-heavy PDFs (no OCR)
- 99%+ accuracy on edge cases; trades precision for speed
- figures or image extraction
installation
pip install pymupdf4llm-c
or Rust:
cargo add pymupdf4llm-c
wheels for Python 3.10–3.13 on macOS (ARM/x64) and Linux (glibc > 2.11). no Windows; see BUILD.md to compile.
usage
Python
basic
from pymupdf4llm_c import to_json
output_file = to_json("example.pdf")
print(f"Extracted to: {output_file}")
collect in memory
pages = to_json("report.pdf", collect=True)
for page_obj in pages:
blocks = page_obj.get("data", [])
for block in blocks:
print(f"{block.get('type')}: {block.get('text', '')}")
large files (streaming)
from pymupdf4llm_c import iterate_json_pages
for page_blocks in iterate_json_pages("large.pdf"):
for block in page_blocks:
print(f"Block type: {block['type']}")
per-page files
json_files = to_json(pdf_path, output_dir="output_json")
command-line
python -m pymupdf4llm_c.main input.pdf [output_dir]
Rust
use pymupdf4llm_c::{to_json, to_json_collect, PdfError};
fn main() -> Result<(), PdfError> {
let paths = to_json("example.pdf", None)?;
println!("Generated {} files", paths.len());
let pages = to_json_collect("example.pdf", None)?;
println!("Parsed {} pages", pages.len());
Ok(())
}
output structure
each page is a JSON array of blocks:
[
{
"type": "heading",
"text": "Introduction",
"bbox": [72.0, 100.5, 523.5, 130.2],
"font_size": 21.64,
"font_weight": "bold",
"page_number": 0
},
{
"type": "paragraph",
"text": "This document describes...",
"bbox": [72.0, 140.5, 523.5, 200.2],
"font_size": 12.0,
"page_number": 0
},
{
"type": "table",
"bbox": [72.0, 220.0, 523.5, 400.0],
"row_count": 3,
"col_count": 2,
"rows": [
{
"cells": [
{ "text": "Header A", "bbox": [72.0, 220.0, 297.75, 250.0] },
{ "text": "Header B", "bbox": [297.75, 220.0, 523.5, 250.0] }
]
}
]
}
]
fields: type (text, heading, paragraph, table, list, code), bbox (x0, y0, x1, y1), font_size, font_weight, spans (when styled).
faq
why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.
how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.
will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.
commercial use?
only under AGPL v3 or with a license from Artifex (MuPDF's creators). see LICENSE.
building from source
see BUILD.md.
development
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
build native extractor, then:
./lint.sh
pytest
license
AGPL v3. commercial use requires license from Artifex.
links
- repo: github.com/intercepted16/pymupdf4llm-C
- pypi: pymupdf4llm-C
feedback welcome.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f111c32475f3fde89f60defe2e0c0c7bc9b37650299edfdf226989f716ba5f44
|
|
| MD5 |
d80dffd8237e0b95b7c46edf3d08203b
|
|
| BLAKE2b-256 |
c6318eacf24180cd54e392a762ba124d6e8f993a5f46e5626156e785ce40a130
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
f111c32475f3fde89f60defe2e0c0c7bc9b37650299edfdf226989f716ba5f44 - Sigstore transparency entry: 785415738
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8acb9b08a77531d8e6dbda34493c2438de259bdec38616d584c6074255029c1d
|
|
| MD5 |
93a9cd7f673aa8d2d7cda5cbe9268d98
|
|
| BLAKE2b-256 |
1b78319792c665f5f38c2f79350fac338afb1806ecbbfdcaa60b34c61526bc6d
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp312-cp312-macosx_11_0_arm64.whl -
Subject digest:
8acb9b08a77531d8e6dbda34493c2438de259bdec38616d584c6074255029c1d - Sigstore transparency entry: 785415761
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
994db6be0fdcd8b398f3069835931d6da2293f71275ce18775610e930aef7d2b
|
|
| MD5 |
22b262bcdc26995814fa5e87220201fe
|
|
| BLAKE2b-256 |
c335cf1613391ad1e7b46ce7370946daea6e0be5f4303c09d60bafd19147c418
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
994db6be0fdcd8b398f3069835931d6da2293f71275ce18775610e930aef7d2b - Sigstore transparency entry: 785415731
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3269890237b29dfde102349027b9d4a522aea27e0b20457e8841be1ab28bc048
|
|
| MD5 |
c0d9fd828962a2e69cf29e9323b8e818
|
|
| BLAKE2b-256 |
26ccfdfe0d2227507fee28f4a9fdc0613261beda24536fe0333403a7dbb1267e
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp311-cp311-macosx_11_0_arm64.whl -
Subject digest:
3269890237b29dfde102349027b9d4a522aea27e0b20457e8841be1ab28bc048 - Sigstore transparency entry: 785415751
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66a1525a448e477aa57bd0919eb36d186fdc646f4101c250bf487e99bd26d3cf
|
|
| MD5 |
68e593974da099284e35591c2ac021df
|
|
| BLAKE2b-256 |
129bbe51c04746869737221e83c55f391020627a24b09caf2ce75805dfabf39b
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
66a1525a448e477aa57bd0919eb36d186fdc646f4101c250bf487e99bd26d3cf - Sigstore transparency entry: 785415745
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c3ba3506705571ec9d45c1c65f2cf28bcc4bf8c549349de447e7a21ce589204
|
|
| MD5 |
4c471cbfd9d10cdf1eb9a13b01b2da9c
|
|
| BLAKE2b-256 |
cd40a469aaacee0b3e5995a212ff25c99bd78bb0ddc26c9ec4207c32a61fb44c
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp310-cp310-macosx_11_0_arm64.whl -
Subject digest:
7c3ba3506705571ec9d45c1c65f2cf28bcc4bf8c549349de447e7a21ce589204 - Sigstore transparency entry: 785415722
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fddf04f3c6cae4e120bccfe89688bd485e0c452ee97243a5637ca6fa67b2da15
|
|
| MD5 |
fe04fe89919da151a7f045ce1def0534
|
|
| BLAKE2b-256 |
6ce5870dd0581cb00d45b56c825871bfb380138302e3b50b8d592adaa8ec03a0
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
fddf04f3c6cae4e120bccfe89688bd485e0c452ee97243a5637ca6fa67b2da15 - Sigstore transparency entry: 785415727
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0170136adfdf485250cd9ff98d7c7335eb4b9beb05f5f41a60328df76b580083
|
|
| MD5 |
a6e8c5b8164549cd5593465593f200ae
|
|
| BLAKE2b-256 |
0a31e56886e9e85cad2e21bb6feb7e44483b4482c118ccd7c51fd8ee2d941727
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.3.0-cp39-cp39-macosx_11_0_arm64.whl -
Subject digest:
0170136adfdf485250cd9ff98d7c7335eb4b9beb05f5f41a60328df76b580083 - Sigstore transparency entry: 785415754
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44d9d4429214620c49ac0b98f8cc2bef4432ca5a -
Trigger Event:
push
-
Statement type: