Skip to main content

C-backed PDF to structured JSON extractor.

Project description

PyMuPDF4LLM-C

PyMuPDF4LLM-C provides a high-throughput C extractor for MuPDF that emits page-level JSON describing text, layout metadata, figures, and detected tables. It exposes both Python and Rust bindings for safe and ergonomic access.


Highlights

  • Native extractorlibtomd walks each PDF page with MuPDF and writes page_XXX.json artifacts containing block type, geometry, font metrics, and basic heuristics used by retrieval pipelines.
  • Safe, idiomatic bindings – Python (pymupdf4llm_c) and Rust (pymupdf4llm-c) APIs provide easy, memory-safe access without exposing raw C pointers.
  • Single source of truth – All heuristics, normalization, and JSON serialization live in dedicated C modules under src/, with public headers exposed via include/ for downstream extensions.

Installation

Install the Python package from PyPI:

pip install pymupdf4llm-c

For Rust, install with Cargo:

cargo add pymupdf4llm-c

Building the Native Extractor

For instructions on building the C extractor, see the dedicated BUILD.md file. This covers building MuPDF from the submodule, compiling the shared library, and setting up libmupdf.so.


Usage

Python Usage

Basic usage

from pathlib import Path
from pymupdf4llm_c import ConversionConfig, ExtractionError, to_json

pdf_path = Path("example.pdf")

try:
    # Extract to a merged JSON file (default)
    output_file = to_json(pdf_path)
    print(f"Extracted to: {output_file}")
except ExtractionError as exc:
    print(f"Extraction failed: {exc}")

Collecting parsed blocks in memory

Use collect=True to get parsed JSON in memory instead of writing to a file:

from pymupdf4llm_c import to_json

# Returns list of page data (merged JSON structure)
pages = to_json("report.pdf", collect=True)

for page_obj in pages:
    page_num = page_obj.get("page", 0)
    blocks = page_obj.get("data", [])
    print(f"Page {page_num}: {len(blocks)} blocks")
    
    for block in blocks:
        print(f"  Type: {block.get('type')}, Text: {block.get('text', '')}")

Memory and Validation:

  • collect=True validates the JSON structure and raises ValueError if invalid
  • For PDFs larger than ~100MB, a warning is logged recommending iterate_json_pages() instead
  • Disable the warning with warn_large_collect=False:
# Suppress memory warning for large PDFs
pages = to_json("large_document.pdf", collect=True, warn_large_collect=False)

### Iterating pages with validation

For validation and type-safe iteration over JSON page files, use the helper:

```python
from pymupdf4llm_c import iterate_json_pages

# Yields each page as a typed Block list
for page_blocks in iterate_json_pages("path/to/page_001.json"):
    for block in page_blocks:
        print(f"Block: {block['type']}")
        if block['type'] == 'table':
            print(f"  Table: {block.get('row_count')}x{block.get('col_count')}")

Memory-Efficient Iteration: This generator is recommended for large PDFs that would consume significant memory with collect=True. It validates JSON structure on-the-fly and yields pages one at a time:

from pathlib import Path
from pymupdf4llm_c import to_json, iterate_json_pages

# Extract PDF (writes to disk, low memory)
output_file = to_json("large_document.pdf")

# Iterate pages without loading all into memory
for page_blocks in iterate_json_pages(output_file):
    # Process each page individually
    process_page(page_blocks)

Legacy per-page output

Extract to individual per-page JSON files:

output_dir = Path("output_json")
json_files = to_json(pdf_path, output_dir=output_dir)
print(f"Generated {len(json_files)} files")

Override the shared library location

config = ConversionConfig(lib_path=Path("/opt/lib/libtomd.so"))
results = to_json("report.pdf", config=config, collect=True)
Rust Usage

Basic usage

use std::path::Path;
use pymupdf4llm_c::{to_json, to_json_collect, extract_page_json, PdfError};

fn main() -> Result<(), PdfError> {
    let pdf_path = Path::new("example.pdf");

    // Extract to files
    let paths = to_json(pdf_path, None)?;
    println!("Generated {} JSON files:", paths.len());
    for path in &paths {
        println!("  - {:?}", path);
    }

    // Collect JSON in memory
    let pages = to_json_collect(pdf_path, None)?;
    println!("Parsed {} pages in memory", pages.len());

    // Extract single page
    let page_json = extract_page_json(pdf_path, 0)?;
    println!("First page JSON: {}", page_json);

    Ok(())
}
  • Error handling – all functions return Result<_, PdfError>
  • Memory-safe – FFI confined internally, no unsafe needed at the call site
  • Output – file paths or in-memory JSON (serde_json::Value)

Output Structure

JSON Output Structure

Each PDF page is extracted to a separate JSON file (e.g., page_001.json) containing an array of block objects:

[
  {
    "type": "paragraph",
    "text": "Extracted text content",
    "bbox": [72.0, 100.5, 523.5, 130.2],
    "font_size": 11.0,
    "font_weight": "normal",
    "page_number": 0,
    "length": 22
  },
  {
    "type": "text",
    "text": "Bold example text",
    "bbox": [72.0, 140.5, 523.5, 155.2],
    "font_size": 12.0,
    "font_weight": "bold",
    "page_number": 0,
    "length": 17,
    "spans": [
      {
        "text": "Bold example text",
        "bold": true,
        "font_size": 12.0
      }
    ]
  }
]

Key Fields:

  • typetext, heading, paragraph, table, figure, list, code

  • bbox – Bounding box [x0, y0, x1, y1]

  • font_size – Average font size in points

  • font_weightnormal, bold, or other weights

  • spans – (Optional) Array of styled text segments. Only present when:

    • There are multiple text segments with different styling, OR
    • The text has applied styling (bold, italic, monospace, etc.)

    Plain unstyled text blocks will not include the spans array to avoid duplication.

Tables include row_count, col_count, and confidence scores.


Command-line Usage (Python)

python -m pymupdf4llm_c.main input.pdf [output_dir]

If output_dir is omitted, a sibling directory suffixed with _json is created. The command prints the destination and each JSON file that was written.


Development Workflow

  1. Create and activate a virtual environment, then install dev extras:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
  1. Build the native extractor (see BUILD.md)

  2. Run linting and tests:

./lint.sh
pytest

Troubleshooting

  • Library not found – Build libtomd and ensure it is discoverable.
  • Build failures – Check MuPDF headers/libraries.
  • Different JSON output – Heuristics live in C code under src/; rebuild after changes.

License

AGPL v3. Needed because MuPDF is AGPL.

If your project is free and OSS you can use it as long as it’s also AGPL licensed. For commercial projects, you need a license from Artifex, the creators of MuPDF.

See LICENSE for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl (40.9 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1112a1a7dad550312fffe22ff69ca48478c62608f63f3428dc745876c0546f67
MD5 e257bb2e4419b96c6b4c780291c7d515
BLAKE2b-256 49d98b44bb4416e47ddfaa89f45414a6aa52d77f0a22d161068f2c34f89432fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 22adecbbbb0ddee36166747134e68f60163876ba38179d064360a2b5aac50f70
MD5 d8f35118e8d1d08c1853928175b5f0d1
BLAKE2b-256 c6112732e70773fbe10896dbe9197b1b5ffb40673bee738fe73189174f6f4e1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eb293511f6ec6badf49e818326c31e61fbcd4b23083fc22deabdc8b405543e6f
MD5 26d81db6a871d95f3c71371033388a76
BLAKE2b-256 c611ca13458c608d37028861c59861da054c9827640b1b5bec5f1fcfb8a21561

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2c4b5129fe2e422593276fe845536db65b7ebe095991b5013ab019c97674c727
MD5 11cb1277836864d648b5427dc1e89ae6
BLAKE2b-256 a4ab4bddb3d5a294154928ac7bdc7503056d52de608e619ef0ecbd1593802fb9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 446e57a078416670f5b715baea431c6e19118b4dae46553d5d2b6121e833a852
MD5 1a7de3b3ee5407f4ca36f9c57a9ba8fe
BLAKE2b-256 7c5f02c9b894f4a23981386d67f9f73d1565dcd6aa69e85de252f3622d5bbd43

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 38dedc626645ae671e5f26ee446840a30c35a8bd99a028838c5a3ce4a24a4796
MD5 b776ab2b2a364a177b6319a8f9fc301a
BLAKE2b-256 8e932842b417fd3bd694021ee00adb2b4cc6b9c566c5dbd1050f507c248ccc91

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b18934a451a2e5e857e284fa593724acbb59e4b44c87e3028589f2d4dcb69639
MD5 772ef585f625b8aa30e5daa2f90eade9
BLAKE2b-256 76c3adbbd95b2e71543f123819558d92654bee895ec97ec1d78414503cf6332d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0fdb20bdb975fd54030cf160dd1d6341993547841c3acdc71ec45d8dc335d15e
MD5 9f495ca287a9e7eb884cbcc96a7652f4
BLAKE2b-256 564bf58b1a5e1e96c79ccbe94fc66f5c74c6a0a89e95bd87b1309652b92b7829

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymupdf4llm_c-1.2.0-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: publish.yml on intercepted16/pymupdf4llm-C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page