C-backed PDF to Markdown conversion with Python fallbacks

These details have not been verified by PyPI

Project links

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.

Highlights

Native extractor – libtomd walks each PDF page with MuPDF and writes page_XXX.json artifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
Safe, idiomatic bindings – Python (pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers.
Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the Python package from PyPI:

pip install pymupdf4llm-c

For Rust, install with Cargo:

cargo add pymupdf4llm-c

Building the Native Extractor

For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.

Usage

Python Usage

Basic usage

from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Advanced features

Collect parsed JSON structures:

results = to_json("report.pdf", collect=True)
for page_blocks in results:
    for block in page_blocks:
        print(f"Block type: {block['type']}, Text: {block.get('text', '')}")

Override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)

Rust Usage

Basic usage

use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};

fn main() -> Result<(), PdfError> {
    let pdf_path = Path::new("example.pdf");

    // Extract to files
    let paths = to_json(pdf_path, None)?;
    println!("Generated {} JSON files:", paths.len());
    for path in &paths {
        println!("  - {:?}", path);
    }

    // Collect JSON in memory
    let pages = to_json_collect(pdf_path, None)?;
    println!("Parsed {} pages in memory", pages.len());

    // Extract single page
    let page_json = extract_page_json(pdf_path, 0)?;
    println!("First page JSON: {}", page_json);

    Ok(())
}

Error handling – all functions return Result<_, PdfError>
Memory-safe – FFI confined internally, no unsafe needed at the call site
Output – file paths or in-memory JSON (serde_json::Value)

Output Structure

JSON Output Structure

Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:

[
  {
    "type": "paragraph",
    "text": "Extracted text content",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 11.0,
    "font_weight": "normal",
    "page_number": 0,
    "length": 22
  }
]

Block types: paragraph, heading, table, list, figure
Key fields: bbox (bounding box), type, font_size, font_weight
Tables include row_count, col_count, confidence

Command-line Usage (Python)

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.

Development Workflow

Create and activate a virtual environment, then install dev extras:

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Build the native extractor (see BUILD.md)
Run linting and tests:

./lint.sh
pytest

Troubleshooting

Library not found – Build libtomd and ensure it is discoverable.
Build failures – Check MuPDF headers/libraries.
Different JSON output – Heuristics live in C code under src/; rebuild after changes.

License

AGPL v3. Needed because MuPDF is AGPL.

If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.

See LICENSE for full details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.1

Feb 7, 2026

2.0.0

Jan 31, 2026

1.6.4

Jan 28, 2026

1.6.2

Jan 28, 2026

1.6.1

Jan 28, 2026

1.6.0

Jan 11, 2026

1.4.1

Jan 2, 2026

1.4.0

Jan 2, 2026

1.3.0

Dec 31, 2025

1.2.1

Dec 31, 2025

1.2.0

Dec 15, 2025

1.1.1

Dec 13, 2025

1.1.0

Dec 13, 2025

1.0.6

Nov 24, 2025

1.0.5

Nov 23, 2025

This version

1.0.4

Nov 22, 2025

1.0.3

Nov 22, 2025

1.0.1

Nov 17, 2025

1.0.0

Oct 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymupdf4llm_c-1.0.4-cp311-cp311-manylinux_2_28_x86_64.whl (77.3 MB view details)

Uploaded Nov 22, 2025 CPython 3.11manylinux: glibc 2.28+ x86-64

pymupdf4llm_c-1.0.4-cp311-cp311-macosx_15_0_arm64.whl (38.6 MB view details)

Uploaded Nov 22, 2025 CPython 3.11macOS 15.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.0.4-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

Download URL: pymupdf4llm_c-1.0.4-cp311-cp311-manylinux_2_28_x86_64.whl
Upload date: Nov 22, 2025
Size: 77.3 MB
Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pymupdf4llm_c-1.0.4-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`7bbd67b13a10b26ce3d04e71d09e92f016718c474d9c75bca9da5f903ac0e34a`
MD5	`0690a3f691de3bff419c5e84adac4bba`
BLAKE2b-256	`d2fdc2f960b7967f4bc443b4d254b6e6577299381b4c423e4b8c116db62f4bad`

See more details on using hashes here.

File details

Details for the file pymupdf4llm_c-1.0.4-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

Download URL: pymupdf4llm_c-1.0.4-cp311-cp311-macosx_15_0_arm64.whl
Upload date: Nov 22, 2025
Size: 38.6 MB
Tags: CPython 3.11, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pymupdf4llm_c-1.0.4-cp311-cp311-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`964ba62086fee432d3dcbdca2d89b36a4030ac7d6069082a87149f342736da55`
MD5	`b530489791f4b80fd1cf3b22f45e8d6e`
BLAKE2b-256	`035b8768f3b5b790ae2b488550aa85190e545af6590fe7cbede0bb079b7a7662`

See more details on using hashes here.

pymupdf4llm-c 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PyMuPDF4LLM-C

Highlights

Installation

Building the Native Extractor

Usage

Basic usage

Advanced features

Basic usage

Output Structure

Command-line Usage (Python)

Development Workflow

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes