C-backed PDF to Markdown conversion with Python fallbacks
Project description
PyMuPDF4LLM-C
PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected Tables. The Python package layers a small ctypes shim and convenience API on top.
Highlights
- Native extractor –
libtomdwalks each PDF page with MuPDF and writespage_XXX.jsonartefacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines. - Python-friendly API –
pymupdf4llm_c.to_json()returns the generated JSON paths or (optionally) the parsed payloads so it slots into existing tooling. - Single source of truth – All heuristics, normalisation, and JSON
serialisation now live in dedicated C modules under
src/, with public headers exposed viainclude/for downstream extensions.
Installation
Install the published wheel or sdist directly from PyPI:
pip install pymupdf4llm-c
The wheel bundles a prebuilt libtomd for common platforms. If the shared
library cannot be located at runtime you will receive a LibraryLoadError.
Provide the path manually via ConversionConfig(lib_path=...) or the
PYMUPDF4LLM_C_LIB environment variable.
Building the native extractor
When working from source (or on an unsupported platform) build the C library before invoking the Python API:
./build.sh # Release build in build/native
BUILD_DIR=build/debug ./build.sh # Custom build directory
CMAKE_BUILD_TYPE=Debug ./build.sh
The script configures CMake, compiles libtomd, and leaves the artefact under
build/ so the Python package can find it. The headers are under include/
if you need to consume the C API directly.
Python quick start
from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json
pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")
try:
json_files = to_json(pdf_path, output_dir=output_dir)
print(f"Generated {len(json_files)} files:")
for path in json_files:
print(f" - {path}")
except ExtractionError as exc:
print(f"Extraction failed: {exc}")
Pass collect=True to to_json if you want the parsed JSON structures
returned instead of file paths. The optional ConversionConfig lets you
override the shared library location:
config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Command-line usage
The package includes a minimal CLI that mirrors the Python API:
python -m pymupdf4llm_c.main input.pdf [output_dir]
If output_dir is omitted a sibling directory suffixed with _json is
created. The command prints the destination and each JSON file that was
written.
Development workflow
- Create and activate a virtual environment, then install the project in
editable mode with the dev extras:
python -m venv .venv source .venv/bin/activate pip install -e .[dev]
- Build the native extractor (
./build.sh) so tests can loadlibtomd. - Run linting and the test suite:
./lint.sh pytest
requirements-test.txt lists the testing dependencies if you prefer manual
installation.
Troubleshooting
- Library not found – Build the extractor and ensure the resulting
libtomd.*is on disk. SetPYMUPDF4LLM_C_LIBorConversionConfig(lib_path=...)if the default search paths do not apply to your environment. - Build failures – Verify MuPDF development headers and libraries are
installed and on the compiler's search path. Consult
CMakeLists.txtfor the expected dependencies. - Different JSON output – The heuristics live entirely inside the C code
under
src/. Adjust them there and rebuild to change behaviour.
License
See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymupdf4llm_c-1.0.0.tar.gz.
File metadata
- Download URL: pymupdf4llm_c-1.0.0.tar.gz
- Upload date:
- Size: 36.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98d564441f378c69a94aa2d7594ad7c8695afe2bd8b1738b9db39dddbaa8be8d
|
|
| MD5 |
cd2ac6de2a32ca41beab9b9c859329b8
|
|
| BLAKE2b-256 |
f74dab87efa6545b66c782bb734c79d927374e9eb1be79fb0c18d97402b88a8f
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.0.0.tar.gz:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.0.0.tar.gz -
Subject digest:
98d564441f378c69a94aa2d7594ad7c8695afe2bd8b1738b9db39dddbaa8be8d - Sigstore transparency entry: 618532502
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@16b7f07b36a37da1fe4061dc1bb951727258cec1 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@16b7f07b36a37da1fe4061dc1bb951727258cec1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pymupdf4llm_c-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pymupdf4llm_c-1.0.0-py3-none-any.whl
- Upload date:
- Size: 60.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31afb01ef6241a74d3340601815fdac7c0055a8fe67ebdf4a3f16a822e313ee2
|
|
| MD5 |
4643f22ea571c079d71d7080e37079c1
|
|
| BLAKE2b-256 |
86075f71e7471c4fa1fde71e97bb9410f93df339e2b7eaea6b7070f4fd02bfac
|
Provenance
The following attestation bundles were made for pymupdf4llm_c-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on intercepted16/pymupdf4llm-C
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymupdf4llm_c-1.0.0-py3-none-any.whl -
Subject digest:
31afb01ef6241a74d3340601815fdac7c0055a8fe67ebdf4a3f16a822e313ee2 - Sigstore transparency entry: 618532506
- Sigstore integration time:
-
Permalink:
intercepted16/pymupdf4llm-C@16b7f07b36a37da1fe4061dc1bb951727258cec1 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/intercepted16
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@16b7f07b36a37da1fe4061dc1bb951727258cec1 -
Trigger Event:
push
-
Statement type: