Skip to main content

C-backed PDF to Markdown conversion with Python fallbacks

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected Tables. The Python package layers a small ctypes shim and convenience API on top.

Highlights

  • Native extractorlibtomd walks each PDF page with MuPDF and writes page_XXX.json artefacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
  • Python-friendly APIpymupdf4llm_c.to_json() returns the generated JSON paths or (optionally) the parsed payloads so it slots into existing tooling.
  • Single source of truth – All heuristics, normalisation, and JSON serialisation now live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the published wheel or sdist directly from PyPI:

pip install pymupdf4llm-c

The wheel bundles a prebuilt libtomd for common platforms. If the shared library cannot be located at runtime you will receive a LibraryLoadError. Provide the path manually via ConversionConfig(lib_path=...) or the PYMUPDF4LLM_C_LIB environment variable.

Building the native extractor

When working from source (or on an unsupported platform) build the C library before invoking the Python API:

./build.sh                      # Release build in build/native
BUILD_DIR=build/debug ./build.sh # Custom build directory
CMAKE_BUILD_TYPE=Debug ./build.sh

The script configures CMake, compiles libtomd, and leaves the artefact under build/ so the Python package can find it. The headers are under include/ if you need to consume the C API directly.

Python quick start

from pathlib import Path

from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Pass collect=True to to_json if you want the parsed JSON structures returned instead of file paths. The optional ConversionConfig lets you override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)

Command-line usage

The package includes a minimal CLI that mirrors the Python API:

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.

Development workflow

  1. Create and activate a virtual environment, then install the project in editable mode with the dev extras:
    python -m venv .venv
    source .venv/bin/activate
    pip install -e .[dev]
    
  2. Build the native extractor (./build.sh) so tests can load libtomd.
  3. Run linting and the test suite:
    ./lint.sh
    pytest
    

requirements-test.txt lists the testing dependencies if you prefer manual installation.

Troubleshooting

  • Library not found – Build the extractor and ensure the resulting libtomd.* is on disk. Set PYMUPDF4LLM_C_LIB or ConversionConfig(lib_path=...) if the default search paths do not apply to your environment.
  • Build failures – Verify MuPDF development headers and libraries are installed and on the compiler's search path. Consult CMakeLists.txt for the expected dependencies.
  • Different JSON output – The heuristics live entirely inside the C code under src/. Adjust them there and rebuild to change behaviour.

License

See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymupdf4llm_c-1.0.0.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.0.0-py3-none-any.whl (60.1 MB view details)

Uploaded Python 3

File details

Details for the file pymupdf4llm_c-1.0.0.tar.gz.

File metadata

  • Download URL: pymupdf4llm_c-1.0.0.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pymupdf4llm_c-1.0.0.tar.gz
Algorithm Hash digest
SHA256 98d564441f378c69a94aa2d7594ad7c8695afe2bd8b1738b9db39dddbaa8be8d
MD5 cd2ac6de2a32ca41beab9b9c859329b8
BLAKE2b-256 f74dab87efa6545b66c782bb734c79d927374e9eb1be79fb0c18d97402b88a8f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.0.tar.gz:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pymupdf4llm_c-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 60.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pymupdf4llm_c-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 31afb01ef6241a74d3340601815fdac7c0055a8fe67ebdf4a3f16a822e313ee2
MD5 4643f22ea571c079d71d7080e37079c1
BLAKE2b-256 86075f71e7471c4fa1fde71e97bb9410f93df339e2b7eaea6b7070f4fd02bfac

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.0-py3-none-any.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page