Skip to main content

C-backed PDF to Markdown conversion with Python fallbacks

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.


Highlights

  • Native extractorlibtomd walks each PDF page with MuPDF and writes page_XXX.json artifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
  • Safe, idiomatic bindings – Python (pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers.
  • Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the Python package from PyPI:

pip install pymupdf4llm-c

For Rust, install with Cargo:

cargo add pymupdf4llm-c

Building the Native Extractor

For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.


Usage

Python Usage

Basic usage

from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Advanced features

Collect parsed JSON structures:

results = to_json("report.pdf", collect=True)
for page_blocks in results:
    for block in page_blocks:
        print(f"Block type: {block['type']}, Text: {block.get('text', '')}")

Override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage

Basic usage

use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};

fn main() -> Result<(), PdfError> {
    let pdf_path = Path::new("example.pdf");

    // Extract to files
    let paths = to_json(pdf_path, None)?;
    println!("Generated {} JSON files:", paths.len());
    for path in &paths {
        println!("  - {:?}", path);
    }

    // Collect JSON in memory
    let pages = to_json_collect(pdf_path, None)?;
    println!("Parsed {} pages in memory", pages.len());

    // Extract single page
    let page_json = extract_page_json(pdf_path, 0)?;
    println!("First page JSON: {}", page_json);

    Ok(())
}
  • Error handling – all functions return Result<_, PdfError>
  • Memory-safe – FFI confined internally, no unsafe needed at the call site
  • Output – file paths or in-memory JSON (serde_json::Value)

Output Structure

JSON Output Structure

Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:

[
  {
    "type": "paragraph",
    "text": "Extracted text content",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 11.0,
    "font_weight": "normal",
    "page_number": 0,
    "length": 22
  }
]
  • Block types: paragraph, heading, table, list, figure
  • Key fields: bbox (bounding box), type, font_size, font_weight
  • Tables include row_count, col_count, confidence

Command-line Usage (Python)

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.


Development Workflow

  1. Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
  1. Build the native extractor (see BUILD.md)

  2. Run linting and tests:

./lint.sh
pytest

Troubleshooting

  • Library not found – Build libtomd and ensure it is discoverable.
  • Build failures – Check MuPDF headers/libraries.
  • Different JSON output – Heuristics live in C code under src/; rebuild after changes.

License

AGPL v3. Needed because MuPDF is AGPL.

If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.

See LICENSE for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.0.3-cp311-cp311-manylinux_2_28_x86_64.whl (38.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.0.3-cp311-cp311-macosx_13_0_x86_64.whl (38.6 MB view details)

Uploaded CPython 3.11macOS 13.0+ x86-64

File details

Details for the file pymupdf4llm_c-1.0.3-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.3-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 51c3444aad67b0b8954c22b26d46e93929ee1d16b8383388a5cd007d6597f708
MD5 9c6f767c3d96dd762feaa672c07a4812
BLAKE2b-256 5675139e8cf8aa93a7e0fdaa207eac8f4016181bd75d865bb68ded835605ffb8

See more details on using hashes here.

File details

Details for the file pymupdf4llm_c-1.0.3-cp311-cp311-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.3-cp311-cp311-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 cf41553e3f2eb51550d699ef69562bb214a4af9616393d7c5eb2c639d9b1be6f
MD5 12bee7e764c02ce5744cb397cc51e2c5
BLAKE2b-256 9c4c66a7914ca3c20a09e206ca8c45b795eaaec37a1d39d3161a8be5e3e82d83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page