C-backed PDF to structured JSON extractor.
Project description
PyMuPDF4LLM-C
PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.
Highlights
- Native extractor –
libtomdwalks each PDF page with MuPDF and writespage_XXX.jsonartifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines. - Safe, idiomatic bindings – Python (
pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers. - Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under
src/, with public headers exposed viainclude/for downstream extensions.
Installation
Install the Python package from PyPI:
pip install pymupdf4llm-c
For Rust, install with Cargo:
cargo add pymupdf4llm-c
Building the Native Extractor
For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.
Usage
Python Usage
Basic usage
from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json
pdf_path = Path("example.pdf")
try:
# Extract to a merged JSON file (default)
output_file = to_json(pdf_path)
print(f"Extracted to: {output_file}")
except ExtractionError as exc:
print(f"Extraction failed: {exc}")
Collecting parsed blocks in memory
Use collect=True to get parsed JSON in memory instead of writing to a file:
from pymupdf4llm_c import to_json
# Returns list of page data (merged JSON structure)
pages = to_json("report.pdf", collect=True)
for page_obj in pages:
page_num = page_obj.get("page", 0)
blocks = page_obj.get("data", [])
print(f"Page {page_num}: {len(blocks)} blocks")
for block in blocks:
print(f" Type: {block.get('type')}, Text: {block.get('text', '')}")
Memory and Validation:
collect=Truevalidates the JSON structure and raisesValueErrorif invalid- For PDFs larger than ~100MB, a warning is logged recommending
iterate_json_pages()instead - Disable the warning with
warn_large_collect=False:
# Suppress memory warning for large PDFs
pages = to_json("large_document.pdf", collect=True, warn_large_collect=False)
### Iterating pages with validation
For validation and type-safe iteration over JSON page files, use the helper:
```python
from pymupdf4llm_c import iterate_json_pages
# Yields each page as a typed Block list
for page_blocks in iterate_json_pages("path/to/page_001.json"):
for block in page_blocks:
print(f"Block: {block['type']}")
if block['type'] == 'table':
print(f" Table: {block.get('row_count')}x{block.get('col_count')}")
Memory-Efficient Iteration:
This generator is recommended for large PDFs that would consume significant memory with collect=True. It validates JSON structure on-the-fly and yields pages one at a time:
from pathlib import Path
from pymupdf4llm_c import to_json, iterate_json_pages
# Extract PDF (writes to disk, low memory)
output_file = to_json("large_document.pdf")
# Iterate pages without loading all into memory
for page_blocks in iterate_json_pages(output_file):
# Process each page individually
process_page(page_blocks)
Legacy per-page output
Extract to individual per-page JSON files:
output_dir = Path("output_json")
json_files = to_json(pdf_path, output_dir=output_dir)
print(f"Generated {len(json_files)} files")
Override the shared library location
config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage
Basic usage
use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};
fn main() -> Result<(), PdfError> {
let pdf_path = Path::new("example.pdf");
// Extract to files
let paths = to_json(pdf_path, None)?;
println!("Generated {} JSON files:", paths.len());
for path in &paths {
println!(" - {:?}", path);
}
// Collect JSON in memory
let pages = to_json_collect(pdf_path, None)?;
println!("Parsed {} pages in memory", pages.len());
// Extract single page
let page_json = extract_page_json(pdf_path, 0)?;
println!("First page JSON: {}", page_json);
Ok(())
}
- Error handling – all functions return
Result<_, PdfError> - Memory-safe – FFI confined internally, no
unsafeneeded at the call site - Output – file paths or in-memory JSON (
serde_json::Value)
Output Structure
JSON Output Structure
Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:
[
{
"type": "paragraph",
"text": "Extracted text content",
"bbox": [72.0, 100.5, 523.5, 130.2],
"font_size": 11.0,
"font_weight": "normal",
"page_number": 0,
"length": 22
},
{
"type": "text",
"text": "Bold example text",
"bbox": [72.0, 140.5, 523.5, 155.2],
"font_size": 12.0,
"font_weight": "bold",
"page_number": 0,
"length": 17,
"spans": [
{
"text": "Bold example text",
"bold": true,
"font_size": 12.0
}
]
}
]
Key Fields:
-
type –
text,heading,paragraph,table,figure,list,code -
bbox – Bounding box
[x0, y0, x1, y1] -
font_size – Average font size in points
-
font_weight –
normal,bold, or other weights -
spans – (Optional) Array of styled text segments. Only present when:
- There are multiple text segments with different styling, OR
- The text has applied styling (bold, italic, monospace, etc.)
Plain unstyled text blocks will not include the
spansarray to avoid duplication.
Tables include row_count, col_count, and confidence scores.
Command-line Usage (Python)
python -m pymupdf4llm_c.main input.pdf [output_dir]
If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.
Development Workflow
- Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
-
Build the native extractor (see BUILD.md)
-
Run linting and tests:
./lint.sh
pytest
Troubleshooting
- Library not found – Build
libtomdand ensure it is discoverable. - Build failures – Check MuPDF headers/libraries.
- Different JSON output – Heuristics live in C code under
src/; rebuild after changes.
License
AGPL v3. Needed because MuPDF is AGPL.
If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.
See LICENSE for full details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1112a1a7dad550312fffe22ff69ca48478c62608f63f3428dc745876c0546f67
|
|
| MD5 |
e257bb2e4419b96c6b4c780291c7d515
|
|
| BLAKE2b-256 |
49d98b44bb4416e47ddfaa89f45414a6aa52d77f0a22d161068f2c34f89432fc
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
1112a1a7dad550312fffe22ff69ca48478c62608f63f3428dc745876c0546f67 - Sigstore transparency entry: 764042601
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22adecbbbb0ddee36166747134e68f60163876ba38179d064360a2b5aac50f70
|
|
| MD5 |
d8f35118e8d1d08c1853928175b5f0d1
|
|
| BLAKE2b-256 |
c6112732e70773fbe10896dbe9197b1b5ffb40673bee738fe73189174f6f4e1d
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl -
Subject digest:
22adecbbbb0ddee36166747134e68f60163876ba38179d064360a2b5aac50f70 - Sigstore transparency entry: 764042589
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb293511f6ec6badf49e818326c31e61fbcd4b23083fc22deabdc8b405543e6f
|
|
| MD5 |
26d81db6a871d95f3c71371033388a76
|
|
| BLAKE2b-256 |
c611ca13458c608d37028861c59861da054c9827640b1b5bec5f1fcfb8a21561
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
eb293511f6ec6badf49e818326c31e61fbcd4b23083fc22deabdc8b405543e6f - Sigstore transparency entry: 764042611
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c4b5129fe2e422593276fe845536db65b7ebe095991b5013ab019c97674c727
|
|
| MD5 |
11cb1277836864d648b5427dc1e89ae6
|
|
| BLAKE2b-256 |
a4ab4bddb3d5a294154928ac7bdc7503056d52de608e619ef0ecbd1593802fb9
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl -
Subject digest:
2c4b5129fe2e422593276fe845536db65b7ebe095991b5013ab019c97674c727 - Sigstore transparency entry: 764042596
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
446e57a078416670f5b715baea431c6e19118b4dae46553d5d2b6121e833a852
|
|
| MD5 |
1a7de3b3ee5407f4ca36f9c57a9ba8fe
|
|
| BLAKE2b-256 |
7c5f02c9b894f4a23981386d67f9f73d1565dcd6aa69e85de252f3622d5bbd43
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
446e57a078416670f5b715baea431c6e19118b4dae46553d5d2b6121e833a852 - Sigstore transparency entry: 764042606
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38dedc626645ae671e5f26ee446840a30c35a8bd99a028838c5a3ce4a24a4796
|
|
| MD5 |
b776ab2b2a364a177b6319a8f9fc301a
|
|
| BLAKE2b-256 |
8e932842b417fd3bd694021ee00adb2b4cc6b9c566c5dbd1050f507c248ccc91
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl -
Subject digest:
38dedc626645ae671e5f26ee446840a30c35a8bd99a028838c5a3ce4a24a4796 - Sigstore transparency entry: 764042570
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.2 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b18934a451a2e5e857e284fa593724acbb59e4b44c87e3028589f2d4dcb69639
|
|
| MD5 |
772ef585f625b8aa30e5daa2f90eade9
|
|
| BLAKE2b-256 |
76c3adbbd95b2e71543f123819558d92654bee895ec97ec1d78414503cf6332d
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
b18934a451a2e5e857e284fa593724acbb59e4b44c87e3028589f2d4dcb69639 - Sigstore transparency entry: 764042579
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fdb20bdb975fd54030cf160dd1d6341993547841c3acdc71ec45d8dc335d15e
|
|
| MD5 |
9f495ca287a9e7eb884cbcc96a7652f4
|
|
| BLAKE2b-256 |
564bf58b1a5e1e96c79ccbe94fc66f5c74c6a0a89e95bd87b1309652b92b7829
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl -
Subject digest:
0fdb20bdb975fd54030cf160dd1d6341993547841c3acdc71ec45d8dc335d15e - Sigstore transparency entry: 764042598
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@bc203a87470f4e7554d706a1ee0731898e002664 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc203a87470f4e7554d706a1ee0731898e002664 -
Trigger Event:
push
-
Statement type: