Skip to main content

C-backed PDF to Markdown conversion with Python fallbacks

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.


Highlights

  • Native extractorlibtomd walks each PDF page with MuPDF and writes page_XXX.json artifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
  • Safe, idiomatic bindings – Python (pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers.
  • Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the Python package from PyPI:

pip install pymupdf4llm-c

For Rust, install with Cargo:

cargo add pymupdf4llm-c

Building the Native Extractor

For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.


Usage

Python Usage

Basic usage

from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Advanced features

Collect parsed JSON structures:

results = to_json("report.pdf", collect=True)
for page_blocks in results:
    for block in page_blocks:
        print(f"Block type: {block['type']}, Text: {block.get('text', '')}")

Override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage

Basic usage

use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};

fn main() -> Result<(), PdfError> {
    let pdf_path = Path::new("example.pdf");

    // Extract to files
    let paths = to_json(pdf_path, None)?;
    println!("Generated {} JSON files:", paths.len());
    for path in &paths {
        println!("  - {:?}", path);
    }

    // Collect JSON in memory
    let pages = to_json_collect(pdf_path, None)?;
    println!("Parsed {} pages in memory", pages.len());

    // Extract single page
    let page_json = extract_page_json(pdf_path, 0)?;
    println!("First page JSON: {}", page_json);

    Ok(())
}
  • Error handling – all functions return Result<_, PdfError>
  • Memory-safe – FFI confined internally, no unsafe needed at the call site
  • Output – file paths or in-memory JSON (serde_json::Value)

Output Structure

JSON Output Structure

Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:

[
  {
    "type": "paragraph",
    "text": "Extracted text content",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 11.0,
    "font_weight": "normal",
    "page_number": 0,
    "length": 22
  }
]
  • Block types: paragraph, heading, table, list, figure
  • Key fields: bbox (bounding box), type, font_size, font_weight
  • Tables include row_count, col_count, confidence

Command-line Usage (Python)

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.


Development Workflow

  1. Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
  1. Build the native extractor (see BUILD.md)

  2. Run linting and tests:

./lint.sh
pytest

Troubleshooting

  • Library not found – Build libtomd and ensure it is discoverable.
  • Build failures – Check MuPDF headers/libraries.
  • Different JSON output – Heuristics live in C code under src/; rebuild after changes.

License

AGPL v3. Needed because MuPDF is AGPL.

If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.

See LICENSE for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.0.6-cp312-cp312-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pymupdf4llm_c-1.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.0.6-cp311-cp311-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pymupdf4llm_c-1.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.0.6-cp310-cp310-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pymupdf4llm_c-1.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.0.6-cp39-cp39-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a0813ab807025ab27f4d78da76275ccc0c4d3f735e52f752ba76d4bd43618cee
MD5 8a7120f17775abf1fe23cc4bc23c7d25
BLAKE2b-256 ae2de22060ad3d1c9ef6c728473603e9aadb23c713adee8a58b337a84a314bc0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.6-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a68e7ecd2f2a750581a7cf73e7e3a97093574f7c13875573f6d9c5e4b90d4eaa
MD5 e99f481ae73eacf14d916457964769d6
BLAKE2b-256 56b670154af09182a738cc50d55936e5dd996378fc62950973da1a39a125e148

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a5411f373b2788ae56058a89359c419119a8bec7ca5c93d7519bb3bff6f05541
MD5 1bf9a3e3f2da7184f969674f43a253b5
BLAKE2b-256 fc861eb1bd579708d8ffc0b6d12ac051b54e92305bb1fd9b5e62accfc69c8c91

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.6-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2f1e80de4f54f6deed23036f08b56dd7975a8dbe6347ea6539cc20a9bb7af957
MD5 9ce91ec0c752a7207d035b4b5c7d9068
BLAKE2b-256 648d5d47dd829c89b47f9e35e81203a18e9da461d9870d2ec13bee20fa056364

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f978e38d7096256fb2db02aa20f419f6ecc070bc7bdb4a457c3dc16fb5e4c689
MD5 7a3da3f450e7594123a4da5e793ea106
BLAKE2b-256 d4c9056a26bcba0b5b855eff0fb9d85b6942a42d752e6c9e4b6220a029ce4d98

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.6-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e3ee7ec7fb54eb7743a85ebd59f808bbb6d6d4efd3ebe5932d4f10a9a183d211
MD5 a554d8508dfb7d6e7ea623574650f202
BLAKE2b-256 455bd582adce8342a79af67879973bfb46cd382e66a541edd3920406a8a58d2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d59f08d6c5544a46fbc26b4b5996cd9308f5c4e6f0887d55d2474f1bab6cac7c
MD5 ef1dff7b7ee6ab2f4f5ea9129b38be30
BLAKE2b-256 d758f265f5d87dd76e907322e6e9ac516ed8128f40ce31faeea1147924543ffe

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.0.6-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.0.6-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aed28882ef0051ad340d75d2cf17ea14756d16fb03621df25930c7cd88569fd6
MD5 3222b95a40e8729795a2d3de7a360724
BLAKE2b-256 2d3d3d8273a8d7741c3e0159401a5394af67720a79fefad8291317efeae89839

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.0.6-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page