Skip to main content

A fast PDF extractor; a 500 pages/s alternative to Marker, Docling, PyMUPDF4LLM & others.

Project description

FibrumPDF

This projects C extension has now been rewritten in Go. Performance, quality, and code quality have all improved. However, the Python-API remains the same. It's also been renamed from PyMUPDF4LLM-C, because that was way too close to PyMuPDF4LLM.

A fast PDF extractor for Python written in Go using MuPDF in the backend, inspired by pymupdf4llm. I took many of its heuristics and approaches. Initially, it was supposed to be a 1:1 port (just generating the same Markdown output), but I later pivoted.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

Speed (averaged): ~520 pages/second on CPU. 1 million pages in ~32 minutes.

Full performance breakdown here


Installation

pip install pymupdf4llm-c

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


What it's good at

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

What it's bad at

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from pymupdf4llm_c import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"Block type: {block.type}")
    print(f"Has {len(block.spans)} spans")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m pymupdf4llm_c.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (pseudo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Performance Breakdown

Using go/cmd/tomd/main.go with input_pdf [output_dir], I measured performance on:

  • ~1600 page document (path not available)
  • ~150 page document (test_data/pdfs/nist.pdf)

Performance depends on document size and available cores. With more pages to saturate your cores, you may see better throughput. Wall-clock time should scale approximately linearly with core count.

Test system: AMD Ryzen 7 4800H (8 cores, 6 used)

Runtime breakdown:

  • Go code: ~25% of runtime
  • MuPDF: ~75% of runtime

On the NIST document (150 pages): Go spent 78ms out of 363ms total (21%), MuPDF spent 285ms (79%).

Calculated average:

  • 1600 pages in 3000ms + 150 pages in 350ms = 1750 pages in 3350ms
  • ~520 pages/second

Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

feedback welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fibrum_pdf-1.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.0-cp314-cp314-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

fibrum_pdf-1.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.0-cp313-cp313-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fibrum_pdf-1.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.0-cp312-cp312-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fibrum_pdf-1.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.0-cp311-cp311-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fibrum_pdf-1.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.0-cp310-cp310-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fibrum_pdf-1.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.0-cp39-cp39-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fibrum_pdf-1.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8c50d0a56abb8f2c870a92885852335b7940658fd0610ba30e18553d1b156b91
MD5 6aea6ad30aa1d1f102f3e45e9c1ce814
BLAKE2b-256 093026f18f884bb6cbccc8b08428907b24bd61d5da716fd6a2aa6ac19c5c248c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6d014d46f8aab5008adc6acae52da88f9b17b0d7b22281b8ea146b9c5d1e67c0
MD5 d72680ec64a21a03ed5d3b0e31895ad6
BLAKE2b-256 e07d7e097e8c43c78dd59cf2a9ebd0b5ec01c802af3adb9e36202bbb13921d67

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4431261c4ec2944c7494e131d9cb9a0741dd3f5ad79c4795b7a1b86186b7f0aa
MD5 bb172f1d37a3f7a6a9eabd29787fa757
BLAKE2b-256 a140d3e002a38ba2d57facfc3758a210890c41104e4a533e215c12483719ab9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 900205cdc083eb244b6e84103ab114a5ec317ecdbf8d1ed46dfad3c310151a6e
MD5 2528d3f1b4a629bcb74164d7c89addae
BLAKE2b-256 848e6ca953131cf2ddd645dd2d7934a9718b8ea4bf37194675b4d496dfb3b7c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c4b9eede60acbcb75bd160a9106ce87981936a846bf935f58f48c208711def8b
MD5 36839131751bd2e8f928a8a75d2cc057
BLAKE2b-256 dd34ed4ff9b08b7e072d8eb2d59bdffd64e063a344c50df7573750de1dc1d0f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 f6f13f84bc979b485925dbc830724a7cb95a7528fec66c98234f93647506eb6f
MD5 3a5f961db1a8afd10829b35c150071e6
BLAKE2b-256 0e8a5871b049affec1a60e1380e8941cde4d6633c7f65c79cc33e11fca1e5b60

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f1c1567287eeb4e4b3b6a6793db64eee70e24a4c7c22ca0cbf2c5ddb9621ed61
MD5 019cd977d31a18b75d033c6cf1f0b917
BLAKE2b-256 9184d97977a05fda99ef21d94cbbeed9632077bf5e4acf8c0799549602eabe99

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 52ddeaf0a480af5821e6a92531dc5e46fa5d9cba94224df3f21a95975787f994
MD5 490119926e5c41f3a4f3e6fdd25351fc
BLAKE2b-256 bf8707c80e7232fb0304f902c7d83fca9dddd2879d11b7dde6467eee0f3b8029

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 78880c5922cf30c2907a915df7e989375e04bd64891a47785baac1f91ff11d2b
MD5 d86dbebac49d2044dd500a5e399f63bd
BLAKE2b-256 aa3e113ac6d3a63f06f045b2f0a792b554c2022d6e46eb13e1a9be7c3aa6e67c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 e5be6001b4262aa95ac81572c8bc77715506176e086df074bd94228ca38f9f2d
MD5 3d60a8feb12314a341df0aa39f0d31de
BLAKE2b-256 5cca1167500b04d2fd49289e632ac452c91fc075341c3b1471029859ce98d83a

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e988ba40cbda68b4dc76a3336675661a79a3caa12e56a34447d8ef6ee23cd09e
MD5 09a9376435675037f178e6d18f83d4ac
BLAKE2b-256 e72bda7836ca8643e90a097da8a7498255137e2f2560ab889a5cca6af50e1ffb

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.0-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.0-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d13096cab857850190ce220b75acd2caa53b8f53b8efb095459feb25805924b2
MD5 24353cdd273cdbb14079ff6962238de9
BLAKE2b-256 3bfbf24c4c60b0f2911d32ac256191e42bb29055f478851a7b35f312454fb742

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.0-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page