C-backed PDF to Markdown conversion with Python fallbacks

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected Tables. The Python package layers a small ctypes shim and convenience API on top.

Highlights

Native extractor – libtomd walks each PDF page with MuPDF and writes page_XXX.json artefacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
Python-friendly API – pymupdf4llm_c.to_json() returns the generated JSON paths or (optionally) the parsed payloads so it slots into existing tooling.
Single source of truth – All heuristics, normalisation, and JSON serialisation now live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the published wheel or sdist directly from PyPI:

pip install pymupdf4llm-c

The wheel bundles a prebuilt libtomd for common platforms. If the shared library cannot be located at runtime you will receive a LibraryLoadError. Provide the path manually via ConversionConfig(lib_path=...) or the PYMUPDF4LLM_C_LIB environment variable.

Building the native extractor

When working from source (or on an unsupported platform) build the C library before invoking the Python API:

./build.sh                      # Release build in build/native
BUILD_DIR=build/debug ./build.sh # Custom build directory
CMAKE_BUILD_TYPE=Debug ./build.sh

The script configures CMake, compiles libtomd, and leaves the artefact under build/ so the Python package can find it. The headers are under include/ if you need to consume the C API directly.

Python quick start

from pathlib import Path

from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Pass collect=True to to_json if you want the parsed JSON structures returned instead of file paths. The optional ConversionConfig lets you override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)

Command-line usage

The package includes a minimal CLI that mirrors the Python API:

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.

Development workflow

Create and activate a virtual environment, then install the project in editable mode with the dev extras:
```
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
```
Build the native extractor (./build.sh) so tests can load libtomd.
Run linting and the test suite:
```
./lint.sh
pytest
```

requirements-test.txt lists the testing dependencies if you prefer manual installation.

Troubleshooting

Library not found – Build the extractor and ensure the resulting libtomd.* is on disk. Set PYMUPDF4LLM_C_LIB or ConversionConfig(lib_path=...) if the default search paths do not apply to your environment.
Build failures – Verify MuPDF development headers and libraries are installed and on the compiler's search path. Consult CMakeLists.txt for the expected dependencies.
Different JSON output – The heuristics live entirely inside the C code under src/. Adjust them there and rebuild to change behaviour.

License

See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

adit-bajaj

Release history Release notifications | RSS feed

2.0.1

Feb 7, 2026

2.0.0

Jan 31, 2026

1.6.4

Jan 28, 2026

1.6.2

Jan 28, 2026

1.6.1

Jan 28, 2026

1.6.0

Jan 11, 2026

1.4.1

Jan 2, 2026

1.4.0

Jan 2, 2026

1.3.0

Dec 31, 2025

1.2.1

Dec 31, 2025

1.2.0

Dec 15, 2025

1.1.1

Dec 13, 2025

1.1.0

Dec 13, 2025

1.0.6

Nov 24, 2025

1.0.5

Nov 23, 2025

1.0.4

Nov 22, 2025

1.0.3

Nov 22, 2025

1.0.1

Nov 17, 2025

This version

1.0.0

Oct 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymupdf4llm_c-1.0.0.tar.gz (36.3 kB view details)

Uploaded Oct 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymupdf4llm_c-1.0.0-py3-none-any.whl (60.1 MB view details)

Uploaded Oct 17, 2025 Python 3

File details

Details for the file pymupdf4llm_c-1.0.0.tar.gz.

File metadata

Download URL: pymupdf4llm_c-1.0.0.tar.gz
Upload date: Oct 17, 2025
Size: 36.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pymupdf4llm_c-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`98d564441f378c69a94aa2d7594ad7c8695afe2bd8b1738b9db39dddbaa8be8d`
MD5	`cd2ac6de2a32ca41beab9b9c859329b8`
BLAKE2b-256	`f74dab87efa6545b66c782bb734c79d927374e9eb1be79fb0c18d97402b88a8f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.0.tar.gz:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pymupdf4llm_c-1.0.0.tar.gz
- Subject digest: 98d564441f378c69a94aa2d7594ad7c8695afe2bd8b1738b9db39dddbaa8be8d
- Sigstore transparency entry: 618532502
- Sigstore integration time: Oct 17, 2025
Source repository:
- Permalink: intercepted16/pymupdf4llm-C@16b7f07b36a37da1fe4061dc1bb951727258cec1
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/intercepted16
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@16b7f07b36a37da1fe4061dc1bb951727258cec1
- Trigger Event: push

File details

Details for the file pymupdf4llm_c-1.0.0-py3-none-any.whl.

File metadata

Download URL: pymupdf4llm_c-1.0.0-py3-none-any.whl
Upload date: Oct 17, 2025
Size: 60.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pymupdf4llm_c-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31afb01ef6241a74d3340601815fdac7c0055a8fe67ebdf4a3f16a822e313ee2`
MD5	`4643f22ea571c079d71d7080e37079c1`
BLAKE2b-256	`86075f71e7471c4fa1fde71e97bb9410f93df339e2b7eaea6b7070f4fd02bfac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.0-py3-none-any.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pymupdf4llm_c-1.0.0-py3-none-any.whl
- Subject digest: 31afb01ef6241a74d3340601815fdac7c0055a8fe67ebdf4a3f16a822e313ee2
- Sigstore transparency entry: 618532506
- Sigstore integration time: Oct 17, 2025
Source repository:
- Permalink: intercepted16/pymupdf4llm-C@16b7f07b36a37da1fe4061dc1bb951727258cec1
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/intercepted16
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@16b7f07b36a37da1fe4061dc1bb951727258cec1
- Trigger Event: push

pymupdf4llm-c 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

PyMuPDF4LLM-C

Highlights

Installation

Building the native extractor

Python quick start

Command-line usage

Development workflow

Troubleshooting

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance