Skip to main content

C-backed PDF to Markdown conversion with Python fallbacks

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.


Highlights

  • Native extractorlibtomd walks each PDF page with MuPDF and writes page_XXX.json artifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
  • Safe, idiomatic bindings – Python (pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers.
  • Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the Python package from PyPI:

pip install pymupdf4llm-c

For Rust, install with Cargo:

cargo add pymupdf4llm-c

Building the Native Extractor

For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.


Usage

Python Usage

Basic usage

from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Advanced features

Collect parsed JSON structures:

results = to_json("report.pdf", collect=True)
for page_blocks in results:
    for block in page_blocks:
        print(f"Block type: {block['type']}, Text: {block.get('text', '')}")

Override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage

Basic usage

use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};

fn main() -> Result<(), PdfError> {
    let pdf_path = Path::new("example.pdf");

    // Extract to files
    let paths = to_json(pdf_path, None)?;
    println!("Generated {} JSON files:", paths.len());
    for path in &paths {
        println!("  - {:?}", path);
    }

    // Collect JSON in memory
    let pages = to_json_collect(pdf_path, None)?;
    println!("Parsed {} pages in memory", pages.len());

    // Extract single page
    let page_json = extract_page_json(pdf_path, 0)?;
    println!("First page JSON: {}", page_json);

    Ok(())
}
  • Error handling – all functions return Result<_, PdfError>
  • Memory-safe – FFI confined internally, no unsafe needed at the call site
  • Output – file paths or in-memory JSON (serde_json::Value)

Output Structure

JSON Output Structure

Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:

[
  {
    "type": "paragraph",
    "text": "Extracted text content",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 11.0,
    "font_weight": "normal",
    "page_number": 0,
    "length": 22
  }
]
  • Block types: paragraph, heading, table, list, figure
  • Key fields: bbox (bounding box), type, font_size, font_weight
  • Tables include row_count, col_count, confidence

Command-line Usage (Python)

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.


Development Workflow

  1. Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
  1. Build the native extractor (see BUILD.md)

  2. Run linting and tests:

./lint.sh
pytest

Troubleshooting

  • Library not found – Build libtomd and ensure it is discoverable.
  • Build failures – Check MuPDF headers/libraries.
  • Different JSON output – Heuristics live in C code under src/; rebuild after changes.

License

AGPL v3. Needed because MuPDF is AGPL.

If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.

See LICENSE for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0f2a3a4d7715a8019534d500c63dfde28508af6975368dd7d5fd2e051c56ede7
MD5 bb4fef0971a779d3bc8431bc0fd2cec2
BLAKE2b-256 62e5010cb02fcf9ef30f76df104a2cd93496127016d0bb4b339cc452f111f196

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 22d9529791e970c51bfe9b47414bcdad0bc75bd9bc54d61bac941f08aa755ee4
MD5 5e7467bed08f57b26c9931ec5930a419
BLAKE2b-256 433a9d283179ae4f3bd087fe3bfba557f3fd99369e71e2f22992845f7c14618c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8044a99fe9b3ca16b402f0aea6c7681038493a1f316a1603adb34c2f081fb560
MD5 325ded34b9d12b31214da8415e359ad0
BLAKE2b-256 c81b4e58a6400e8984f0a9622d384eca1ea66dcdb26c7d38bfab3af230bd24c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a3d1ce023faea008f29e5eb44bf4b5de4273d60c9f5657b4e47089b6c68a9631
MD5 03fec76322ecce3a8373644b3bbd99d7
BLAKE2b-256 45ebe17fc76d78738784cacc19a4e8b1084467d700c7e8e3d49a844cd7307669

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6bccbaa27d6621f45a10ede8cc0fff5698fca6819103318d8a907260720cbc10
MD5 fe3ed14eb4029de4bfb86384f05ad7f1
BLAKE2b-256 7e92eca1786135d40ece90d394016e62a51a64764ccf8c9bdeb72147721f75cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 03f5cbe542f20573978b36593901025fdf3651535492a8e171d26ce998b426bb
MD5 8950cb6b38e23f12aad9e71b559e2143
BLAKE2b-256 fd171100e6e94b8a9bdc69c9aef6a8ec348dad7ccf175fb86eaf47455f79ac21

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 665e3a548745ca5e0b2c6d52410e2f52276423b45ec8b597ab9769fab41489e2
MD5 6fca5c5e3531f745504fc48fd81a811e
BLAKE2b-256 c19ffb65d64ed9a3c663afe7db9885a08edad6a15d7ece5ea6046056dadccf00

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 457d2743467f337bdb59390a12a38be8ea029e880f37e19b8d11aaaa4d8c7be8
MD5 a7fc83c045ab09c7826b783a64f8d3c4
BLAKE2b-256 a05884801851047b5ec7e6a5e473a7f760278d31e9a7559046458fd86f50f5d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.0-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page