Skip to main content

C-backed PDF to Markdown conversion with Python fallbacks

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.


Highlights

  • Native extractorlibtomd walks each PDF page with MuPDF and writes page_XXX.json artifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
  • Safe, idiomatic bindings – Python (pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers.
  • Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the Python package from PyPI:

pip install pymupdf4llm-c

For Rust, install with Cargo:

cargo add pymupdf4llm-c

Building the Native Extractor

For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.


Usage

Python Usage

Basic usage

from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Advanced features

Collect parsed JSON structures:

results = to_json("report.pdf", collect=True)
for page_blocks in results:
    for block in page_blocks:
        print(f"Block type: {block['type']}, Text: {block.get('text', '')}")

Override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage

Basic usage

use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};

fn main() -> Result<(), PdfError> {
    let pdf_path = Path::new("example.pdf");

    // Extract to files
    let paths = to_json(pdf_path, None)?;
    println!("Generated {} JSON files:", paths.len());
    for path in &paths {
        println!("  - {:?}", path);
    }

    // Collect JSON in memory
    let pages = to_json_collect(pdf_path, None)?;
    println!("Parsed {} pages in memory", pages.len());

    // Extract single page
    let page_json = extract_page_json(pdf_path, 0)?;
    println!("First page JSON: {}", page_json);

    Ok(())
}
  • Error handling – all functions return Result<_, PdfError>
  • Memory-safe – FFI confined internally, no unsafe needed at the call site
  • Output – file paths or in-memory JSON (serde_json::Value)

Output Structure

JSON Output Structure

Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:

[
  {
    "type": "paragraph",
    "text": "Extracted text content",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 11.0,
    "font_weight": "normal",
    "page_number": 0,
    "length": 22
  }
]
  • Block types: paragraph, heading, table, list, figure
  • Key fields: bbox (bounding box), type, font_size, font_weight
  • Tables include row_count, col_count, confidence

Command-line Usage (Python)

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.


Development Workflow

  1. Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
  1. Build the native extractor (see BUILD.md)

  2. Run linting and tests:

./lint.sh
pytest

Troubleshooting

  • Library not found – Build libtomd and ensure it is discoverable.
  • Build failures – Check MuPDF headers/libraries.
  • Different JSON output – Heuristics live in C code under src/; rebuild after changes.

License

AGPL v3. Needed because MuPDF is AGPL.

If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.

See LICENSE for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.0.5-cp311-cp311-manylinux_2_28_x86_64.whl (77.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.0.5-cp311-cp311-macosx_11_0_arm64.whl (38.6 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pymupdf4llm_c-1.0.5-cp310-cp310-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.0.5-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.5-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cf70aa66ad4952b010696157e062351c4531e05a16d53c8efe40dfb4ac9081bb
MD5 7798c6e95ce31268f8b7c7bb30ada28b
BLAKE2b-256 4626732e2b2160cd5947e23daaae2f149c8b295f074ae47cf8018f9b99f33eb4

See more details on using hashes here.

File details

Details for the file pymupdf4llm_c-1.0.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 95ed76eff79f290f80efd187dc215cca884bdc9b9a1c564e6f70a9183a1262ef
MD5 87107ff6b989f793d9d76fa99280042d
BLAKE2b-256 a2582f191b1f5a645b484510bd4f55260536a0a0fe1dcfe6fa7941816c45f52d

See more details on using hashes here.

File details

Details for the file pymupdf4llm_c-1.0.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 81789a4391c25269b250a99cd58d7f7b48511a9a9516d3b5fbde8f6fd8929596
MD5 4bb547e30100a4ccb1ad9ef4012f02ac
BLAKE2b-256 ef0e73c262e031e3885dbf7ceb94fcb6903ab563ed5f23f844165ed11d613f82

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.5-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page