Skip to main content

A fast PDF extractor; a 500 pages/s alternative to Marker, Docling, PyMUPDF4LLM & others.

Project description

FibrumPDF

This projects C extension has now been rewritten in Go. Performance, quality, and code quality have all improved. However, the Python-API remains the same. It's also been renamed from PyMUPDF4LLM-C, because that was way too close to PyMuPDF4LLM.

A fast PDF extractor for Python written in Go using MuPDF in the backend, inspired by pymupdf4llm. I took many of its heuristics and approaches. Initially, it was supposed to be a 1:1 port (just generating the same Markdown output), but I later pivoted.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

Speed (averaged): ~520 pages/second on CPU. 1 million pages in ~32 minutes.

Full performance breakdown here


Installation

pip install pymupdf4llm-c

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


What it's good at

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

What it's bad at

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from pymupdf4llm_c import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"Block type: {block.type}")
    print(f"Has {len(block.spans)} spans")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (pseudo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Performance Breakdown

Using go/cmd/tomd/main.go with input_pdf [output_dir], I measured performance on:

  • ~1600 page document (path not available)
  • ~150 page document (test_data/pdfs/nist.pdf)

Performance depends on document size and available cores. With more pages to saturate your cores, you may see better throughput. Wall-clock time should scale approximately linearly with core count.

Test system: AMD Ryzen 7 4800H (8 cores, 6 used)

Runtime breakdown:

  • Go code: ~25% of runtime
  • MuPDF: ~75% of runtime

On the NIST document (150 pages): Go spent 78ms out of 363ms total (21%), MuPDF spent 285ms (79%).

Calculated average:

  • 1600 pages in 3000ms + 150 pages in 350ms = 1750 pages in 3350ms
  • ~520 pages/second

Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fibrum_pdf-1.0.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.1-cp314-cp314-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

fibrum_pdf-1.0.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.1-cp313-cp313-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fibrum_pdf-1.0.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.1-cp312-cp312-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fibrum_pdf-1.0.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.1-cp311-cp311-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fibrum_pdf-1.0.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.1-cp310-cp310-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fibrum_pdf-1.0.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.1-cp39-cp39-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fibrum_pdf-1.0.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5116ed5436cdabc938c09e8241b24b1b6ff3401872491d5b948ca8ef124f9560
MD5 db3f8dc28d09a32acb9bf49895a8fd2f
BLAKE2b-256 943c0923966584c8778f537ce22b64871437cc3b7be93350061944a219074647

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 4b23ef0b2f178113e4632ba96c8561d2b8d3486a51ae140f98a97eda9d476fab
MD5 4e74aab20253df1a7a9a4faf24beafe9
BLAKE2b-256 52478353ad5ba1d8d718a7fa8a41d79a500c88d4b02bc5e46a987b50e7775474

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 51c338b33e70e91230e5ec3887ec5de7314a88d592188123e8748f7abe82c379
MD5 4e43dc9c8ca3533b4594c6eb13d561fd
BLAKE2b-256 9919b1638c3acf66fdc2f5e097bf08175b542fc01a4d447d1d88e2ddb2de9f88

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 f945143b1c4fc4a0770ebd410249e4e744959b32112697756bc4a411fb0d8faf
MD5 bc1ab0c10aa0748daa29f0197c1435af
BLAKE2b-256 8c718b21c6aa5c609e3ae5a62fc052ab0c7dc49d592ac12175fd79fcc5b3c7ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a948e67a0a20409dcc1e9b1cb2f8fb946c46fc696158490aae75f9347cebe8cd
MD5 58aa9d73501055d4c1ed7d7c6f480848
BLAKE2b-256 d84412a16769e524469d555df15d3cdf3d593f3dba8887ac1de5f6caf6665016

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2ad5483054cb6584841c01ddab6d73da70af3a0ef3284905640197b784a1d200
MD5 cacf0b659125516e94c7b261c357a03e
BLAKE2b-256 c72d26b2036504f3239572857bcc4dbaeabe3c1b9e0ba6ada0d251018105c714

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f20d9ec3306bb2d4f34c82d0fbb2a0ba10d5f19217b41698a29490d766db6c02
MD5 6be6ab4158660402e5994f136d05f9f1
BLAKE2b-256 0ec9da395a9b23ff372d346e4945a6de2de2a1f1342180a2e0ca1156f830ee41

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 bacf72d6b13015d35d1a3e88cfadef04fd9ac6c084f6bab69a9530d3749d69ee
MD5 57d5543d56712b59c51b816d1077b4ae
BLAKE2b-256 110e01c809afa628e5e84416b257faaa32f481705412bb7b685ff57a5e3e738e

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bd981dbc6a89f5687926537f9c3b3b37fd0a3748dc3dbf8b7d4a55f0f0300715
MD5 aff236a0e6f2b43925c336d6239a6689
BLAKE2b-256 b7de68ab84b11f431114c41691a6d1392839ebf424c18e485ce6698d0792d669

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2d18cf9319197bd5bede13248ccbc0294253e8d4d993e4dc47e5211ef43ce904
MD5 f7c9fe97a010636a7d968a10ad83c57d
BLAKE2b-256 f33a040b1d7e586b3f275be7df5716c35e797b413d142e8a50ccf4ffea07a58a

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8eb6574b798ff8323c045f895216cb2c8b6fc51237ef376a877034da8a3229fc
MD5 b4a5966e878398b862c4638fe85ca0e7
BLAKE2b-256 a31606194bc4622626bced15c17faf8d08367d62ffaedd9873372e3a6c19f467

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.1-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.1-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 87886dc3dfe16db968e3eb24f2779ee375df9963f8afa9fe9768ace96e1f74e6
MD5 5752b7427fdd6b3ab88b9e5438da346a
BLAKE2b-256 dcb70199d6e2f91142cacc895f7c99c568216511eb94bfd56a9757dfa3f87814

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.1-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page