Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.


Highlights

  • Native extractorlibtomd walks each PDF page with MuPDF and writes page_XXX.json artifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
  • Safe, idiomatic bindings – Python (pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers.
  • Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the Python package from PyPI:

pip install pymupdf4llm-c

For Rust, install with Cargo:

cargo add pymupdf4llm-c

Building the Native Extractor

For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.


Usage

Python Usage

Basic usage

from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")
output_dir = pdf_path.with_name(f"{pdf_path.stem}_json")

try:
    json_files = to_json(pdf_path, output_dir=output_dir)
    print(f"Generated {len(json_files)} files:")
    for path in json_files:
        print(f"  - {path}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Advanced features

Collect parsed JSON structures:

results = to_json("report.pdf", collect=True)
for page_blocks in results:
    for block in page_blocks:
        print(f"Block type: {block['type']}, Text: {block.get('text', '')}")

Override the shared library location:

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage

Basic usage

use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};

fn main() -> Result<(), PdfError> {
    let pdf_path = Path::new("example.pdf");

    // Extract to files
    let paths = to_json(pdf_path, None)?;
    println!("Generated {} JSON files:", paths.len());
    for path in &paths {
        println!("  - {:?}", path);
    }

    // Collect JSON in memory
    let pages = to_json_collect(pdf_path, None)?;
    println!("Parsed {} pages in memory", pages.len());

    // Extract single page
    let page_json = extract_page_json(pdf_path, 0)?;
    println!("First page JSON: {}", page_json);

    Ok(())
}
  • Error handling – all functions return Result<_, PdfError>
  • Memory-safe – FFI confined internally, no unsafe needed at the call site
  • Output – file paths or in-memory JSON (serde_json::Value)

Output Structure

JSON Output Structure

Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:

[
  {
    "type": "paragraph",
    "text": "Extracted text content",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 11.0,
    "font_weight": "normal",
    "page_number": 0,
    "length": 22
  }
]
  • Block types: paragraph, heading, table, list, figure
  • Key fields: bbox (bounding box), type, font_size, font_weight
  • Tables include row_count, col_count, confidence

Command-line Usage (Python)

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.


Development Workflow

  1. Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
  1. Build the native extractor (see BUILD.md)

  2. Run linting and tests:

./lint.sh
pytest

Troubleshooting

  • Library not found – Build libtomd and ensure it is discoverable.
  • Build failures – Check MuPDF headers/libraries.
  • Different JSON output – Heuristics live in C code under src/; rebuild after changes.

License

AGPL v3. Needed because MuPDF is AGPL.

If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.

See LICENSE for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.1-cp312-cp312-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pymupdf4llm_c-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.1-cp311-cp311-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pymupdf4llm_c-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.1-cp310-cp310-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pymupdf4llm_c-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.1.1-cp39-cp39-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d23b8220c57e6c5702d1d83eec8e57e619cb7da31e1d13d952eac213093a094b
MD5 58ace4277191256a541a79d290eff518
BLAKE2b-256 b0d196b04ba8ed767be5f30e23a53336f2ae7d8417068910bf3013bbf9dbf466

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 31e76f04ac348801f96f0ec8405c3daa4b12f2f1be611d4130ae377ff96bc94b
MD5 5120af79960bf6b5e4f2d0445be4df38
BLAKE2b-256 1c5bb1bbbd2bfa5ad6fbaca1bff760e2f5071422e75cb748b9327bdd92787f78

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5f5116c8cd1dadde8dcc172358dc74f5c4ddebc0c4d09512bc15db7a03295891
MD5 195d84a2d9380833a4a0650607fead6e
BLAKE2b-256 c8c5da5d1bbef7e811c945f008b782fd9fa465d601e864622699789093f4520e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d1dfccd297653843a2ca198c2c39c9b574c82be76e7f00347030c7bf15450d1b
MD5 4e1483c40c24215b8eb71e1d8723e432
BLAKE2b-256 246cae0a182f6370fb2e0479edb94f937f1c53355f893fd685c512a787a370eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 62faa53669171799f918d9470e3b201a7d2ab09e91741535293d8dce927610ca
MD5 16c57f756fe6c558d9602f9935b54bf7
BLAKE2b-256 53f064916080d7c6517533c5372fa67d114ace2b706530baa21303ca3f20602a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0d9e36c5754da6ca0a4b1a74e6bb57663649a6d73f6f5108826e67489a7cfdd6
MD5 dfcc8b4307520b7a42dd6eb63fb47a4c
BLAKE2b-256 255733e9d79721bcc1df1b3075df44c69852f1c6b693628ec296c91d0bab1436

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a72d43460854a85fe9e5cd78f1501691d45c42a46bb08d1af805476207a5a22c
MD5 540cf951a15b8d1ef4edc284329fa7ed
BLAKE2b-256 7e8a4090d88ce33d761399329c6cb73f881f8919db5dc334fed4a7ef177b6f0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.1.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.1.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ca8c1b4295cf80eff3a8958b1b2cf24ea019f0cda5e88f041826332a14030a47
MD5 5973b2db53ba57326a952d903d78f882
BLAKE2b-256 e2d5aedbdf1a1290d5d0e5def1d8c501db290b4325ff571c3ae0858a246d2aaf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.1.1-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page