C-backed PDF to Markdown conversion with Python fallbacks
Project description
PyMuPDF4LLM-C
PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.
Highlights
- Native extractor –
libtomdwalks each PDF page with MuPDF and writespage_XXX.jsonartifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines. - Safe, idiomatic bindings – Python (
pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers. - Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under
src/, with public headers exposed viainclude/for downstream extensions.
Installation
Install the Python package from PyPI:
pip install pymupdf4llm-c
For Rust, install with Cargo:
cargo add pymupdf4llm-c
Building the Native Extractor
For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.
Usage
Python Usage
Basic usage
from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json
pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")
try:
json_files = to_json(pdf_path, output_dir=output_dir)
print(f"Generated {len(json_files)} files:")
for path in json_files:
print(f" - {path}")
except ExtractionError as exc:
print(f"Extraction failed: {exc}")
Advanced features
Collect parsed JSON structures:
results = to_json("report.pdf", collect=True)
for page_blocks in results:
for block in page_blocks:
print(f"Block type: {block['type']}, Text: {block.get('text', '')}")
Override the shared library location:
config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage
Basic usage
use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};
fn main() -> Result<(), PdfError> {
let pdf_path = Path::new("example.pdf");
// Extract to files
let paths = to_json(pdf_path, None)?;
println!("Generated {} JSON files:", paths.len());
for path in &paths {
println!(" - {:?}", path);
}
// Collect JSON in memory
let pages = to_json_collect(pdf_path, None)?;
println!("Parsed {} pages in memory", pages.len());
// Extract single page
let page_json = extract_page_json(pdf_path, 0)?;
println!("First page JSON: {}", page_json);
Ok(())
}
- Error handling – all functions return
Result<_, PdfError> - Memory-safe – FFI confined internally, no
unsafeneeded at the call site - Output – file paths or in-memory JSON (
serde_json::Value)
Output Structure
JSON Output Structure
Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:
[
{
"type": "paragraph",
"text": "Extracted text content",
"bbox": [72.0, 100.5, 523.5, 130.2],
"font_size": 11.0,
"font_weight": "normal",
"page_number": 0,
"length": 22
}
]
- Block types:
paragraph,heading,table,list,figure - Key fields:
bbox(bounding box),type,font_size,font_weight - Tables include
row_count,col_count,confidence
Command-line Usage (Python)
python -m pymupdf4llm_c.main input.pdf [output_dir]
If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.
Development Workflow
- Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
-
Build the native extractor (see BUILD.md)
-
Run linting and tests:
./lint.sh
pytest
Troubleshooting
- Library not found – Build
libtomdand ensure it is discoverable. - Build failures – Check MuPDF headers/libraries.
- Different JSON output – Heuristics live in C code under
src/; rebuild after changes.
License
AGPL v3. Needed because MuPDF is AGPL.
If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.
See LICENSE for full details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.1 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f2a3a4d7715a8019534d500c63dfde28508af6975368dd7d5fd2e051c56ede7
|
|
| MD5 |
bb4fef0971a779d3bc8431bc0fd2cec2
|
|
| BLAKE2b-256 |
62e5010cb02fcf9ef30f76df104a2cd93496127016d0bb4b339cc452f111f196
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
0f2a3a4d7715a8019534d500c63dfde28508af6975368dd7d5fd2e051c56ede7 - Sigstore transparency entry: 763111032
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22d9529791e970c51bfe9b47414bcdad0bc75bd9bc54d61bac941f08aa755ee4
|
|
| MD5 |
5e7467bed08f57b26c9931ec5930a419
|
|
| BLAKE2b-256 |
433a9d283179ae4f3bd087fe3bfba557f3fd99369e71e2f22992845f7c14618c
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl -
Subject digest:
22d9529791e970c51bfe9b47414bcdad0bc75bd9bc54d61bac941f08aa755ee4 - Sigstore transparency entry: 763111045
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.1 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8044a99fe9b3ca16b402f0aea6c7681038493a1f316a1603adb34c2f081fb560
|
|
| MD5 |
325ded34b9d12b31214da8415e359ad0
|
|
| BLAKE2b-256 |
c81b4e58a6400e8984f0a9622d384eca1ea66dcdb26c7d38bfab3af230bd24c2
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
8044a99fe9b3ca16b402f0aea6c7681038493a1f316a1603adb34c2f081fb560 - Sigstore transparency entry: 763111047
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3d1ce023faea008f29e5eb44bf4b5de4273d60c9f5657b4e47089b6c68a9631
|
|
| MD5 |
03fec76322ecce3a8373644b3bbd99d7
|
|
| BLAKE2b-256 |
45ebe17fc76d78738784cacc19a4e8b1084467d700c7e8e3d49a844cd7307669
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl -
Subject digest:
a3d1ce023faea008f29e5eb44bf4b5de4273d60c9f5657b4e47089b6c68a9631 - Sigstore transparency entry: 763111043
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.1 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bccbaa27d6621f45a10ede8cc0fff5698fca6819103318d8a907260720cbc10
|
|
| MD5 |
fe3ed14eb4029de4bfb86384f05ad7f1
|
|
| BLAKE2b-256 |
7e92eca1786135d40ece90d394016e62a51a64764ccf8c9bdeb72147721f75cf
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
6bccbaa27d6621f45a10ede8cc0fff5698fca6819103318d8a907260720cbc10 - Sigstore transparency entry: 763111037
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03f5cbe542f20573978b36593901025fdf3651535492a8e171d26ce998b426bb
|
|
| MD5 |
8950cb6b38e23f12aad9e71b559e2143
|
|
| BLAKE2b-256 |
fd171100e6e94b8a9bdc69c9aef6a8ec348dad7ccf175fb86eaf47455f79ac21
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl -
Subject digest:
03f5cbe542f20573978b36593901025fdf3651535492a8e171d26ce998b426bb - Sigstore transparency entry: 763111041
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 77.1 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
665e3a548745ca5e0b2c6d52410e2f52276423b45ec8b597ab9769fab41489e2
|
|
| MD5 |
6fca5c5e3531f745504fc48fd81a811e
|
|
| BLAKE2b-256 |
c19ffb65d64ed9a3c663afe7db9885a08edad6a15d7ece5ea6046056dadccf00
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
665e3a548745ca5e0b2c6d52410e2f52276423b45ec8b597ab9769fab41489e2 - Sigstore transparency entry: 763111042
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 40.9 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
457d2743467f337bdb59390a12a38be8ea029e880f37e19b8d11aaaa4d8c7be8
|
|
| MD5 |
a7fc83c045ab09c7826b783a64f8d3c4
|
|
| BLAKE2b-256 |
a05884801851047b5ec7e6a5e473a7f760278d31e9a7559046458fd86f50f5d2
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl -
Subject digest:
457d2743467f337bdb59390a12a38be8ea029e880f37e19b8d11aaaa4d8c7be8 - Sigstore transparency entry: 763111026
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6b8946e083672d3685a9aff9ecf49a31f6682d6 -
Trigger Event:
push
-
Statement type: