Skip to main content

A fast PDF extractor; a 500 pages/s alternative to Marker, Docling, PyMUPDF4LLM & others.

Project description

FibrumPDF

This projects C extension has now been rewritten in Go. Performance, quality, and code quality have all improved. However, the Python-API remains the same. It's also been renamed from PyMUPDF4LLM-C, because that was way too close to PyMuPDF4LLM.

A fast PDF extractor for Python written in Go using MuPDF in the backend, inspired by pymupdf4llm. I took many of its heuristics and approaches. Initially, it was supposed to be a 1:1 port (just generating the same Markdown output), but I later pivoted.

Most extractors give you raw text (fast but useless) or full-on OCR/ML. This is a middle ground.

Outputs JSON for every block: text, type, bounding box, font metrics, tables. You get the raw data to process however you need.

Speed (averaged): ~520 pages/second on CPU. 1 million pages in ~32 minutes.

Full performance breakdown here


Installation

pip install fibrum-pdf

You can prefix this with whatever tools you use, like uv, poetry, etc.

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


What it's good at

  • millions of pages, fast
  • custom parsing logic; you own the rules
  • document archives, chunking strategies, any structured extraction
  • CPU only; no expensive inference
  • iterating on parsing logic without waiting hours

What it's bad at

  • scanned or image-heavy PDFs (no OCR)
  • 99%+ accuracy on edge cases; trades precision for speed
  • figures or image extraction

Usage

basic

from fibrum_pdf import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"Block type: {block.type}")
    print(f"Has {len(block.spans)} spans")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

convert to markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

.markdown is a property, not a function

command-line

python -m fibrum_pdf.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. in most blocks, it is likely that there is only one span.

Block types

Not real JSON; just to demonstrate output. (pseudo).

text/paragraph/code blocks:

{
  "type": "text",
  "bbox": [72.03, 132.66, 542.7, 352.22],
  "font_size": 12.0,
  "length": 1145,
  "lines": 14,
  "spans": [
    {
      "text": "Block content here...",
      "font_size": 12.0,
      "bold": false,
      "italic": false,
      "monospace": false,
      "strikeout": false,
      "superscript": false,
      "subscript": false,
      "link": false,
      "uri": false
    }
  ]
}

headings:

{
  "type": "heading",
  "bbox": [111.80, 187.53, 509.10, 217.56],
  "font_size": 32.0,
  "length": 25,
  "level": 1,
  "spans": [
    {
      "text": "Heading Text",
      // all styling flags (as seen in the above)
    }
  ]
}

lists:

{
  "type": "list",
  "bbox": [40.44, 199.44, 107.01, 345.78],
  "font_size": 11.04,
  "length": 89,
  "spans": [],
  "items": [
    {
      "spans": [
        {
          "text": "First item",
		  // all styling flags.
        }
      ],
      "list_type": "bulleted",
      "indent": 0,
      "prefix": false
    },
    {
      "spans": [
        {
          "text": "Second item",
		  // all styling flags.
        }
      ],
      "list_type": "numbered",
      "indent": 0,
      "prefix": "1."
    }
  ]
}

tables:

{
  "type": "table",
  "bbox": [72.0, 220.0, 523.5, 400.0],
  "font_size": 12.0,
  "length": 256,
  "row_count": 3,
  "col_count": 2,
  "cell_count": 2,
  "spans": [],
  "rows": [
    {
      "bbox": [72.0, 220.0, 523.5, 250.0],
      "cells": [
        {
          "bbox": [72.0, 220.0, 297.75, 250.0],
          "spans": [
            {
              "text": "Header A",
              // all styling flags.
            }
          ]
        },
        {
          "bbox": [297.75, 220.0, 523.5, 250.0],
          "spans": [
            {
              "text": "Header B",
              // all styling flags.
            }
          ]
        }
      ]
    }
  ]
}

Span fields

all text spans contain:

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

FAQ

why not marker/docling?
if you have time and need maximum accuracy, use those. this is for when you're processing millions of pages or iterating on extraction logic quickly.

how do i use bounding boxes for semantic chunking?
large y-gaps indicate topic breaks. font size changes show sections. indentation shows hierarchy. you write the logic using the metadata.

will this handle my complex PDF?
optimized for well-formed digital PDFs. scanned documents, complex table structures, and image-heavy layouts won't extract as well as ML tools.

commercial use?
only under AGPL-v3 or with a license from Artifex (MuPDF's creators). see LICENSE

why did you build this? Dumb reason. I was building a RAG project with my dad (I'm 15). He did not care about speed at all. But I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.


Performance Breakdown

Using go/cmd/tomd/main.go with input_pdf [output_dir], I measured performance on:

  • ~1600 page document (path not available)
  • ~150 page document (test_data/pdfs/nist.pdf)

Performance depends on document size and available cores. With more pages to saturate your cores, you may see better throughput. Wall-clock time should scale approximately linearly with core count.

Test system: AMD Ryzen 7 4800H (8 cores, 6 used)

Runtime breakdown:

  • Go code: ~25% of runtime
  • MuPDF: ~75% of runtime

On the NIST document (150 pages): Go spent 78ms out of 363ms total (21%), MuPDF spent 285ms (79%).

Calculated average:

  • 1600 pages in 3000ms + 150 pages in 350ms = 1750 pages in 3350ms
  • ~520 pages/second

Licensing and Links

licensing

TL;DR: use it all you want in OSS software. if you buy license for MUPDF from Artifex, you are excluded from all AGPL requirements.

  • derived work of mupdf.
  • inspired by pymupdf4llm; i have used it as a reference

AGPL v3. commercial use requires license from Artifex.

modifications and enhancements specific to this library are 2026 Adit Bajaj.

see LICENSE for the legal stuff.

links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fibrum_pdf-1.0.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.3-cp314-cp314-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

fibrum_pdf-1.0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.3-cp313-cp313-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fibrum_pdf-1.0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.3-cp312-cp312-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fibrum_pdf-1.0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.3-cp311-cp311-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fibrum_pdf-1.0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.3-cp310-cp310-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fibrum_pdf-1.0.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.3-cp39-cp39-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fibrum_pdf-1.0.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a201493f848b6021942f7b7041e8026240e5ed8ba53f0d466b290a5ad5e5431b
MD5 4c72427a734574df6cdc3008148d48f0
BLAKE2b-256 e3a59c91dc5ae1f0342a7adec7b039d0d0820c2ad992c147354a91ee78b77cee

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 0f170e5d0d99661c3cbd8ed4540fe79fa34615436ec3fbb0de661837b4df5e46
MD5 192c0d14303de45227cfbf1b42d5ce24
BLAKE2b-256 9fba9bf11a76e9e9cb4bc9853fca518d88ab6cb6469ea0a2159350c308e8dfd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 538968b22b2aa2bd909b00b2d1d22d60d2e7897fd0f828609614b48871a7c4b0
MD5 e86fd6b5e1767f2dd96ca7021096cda7
BLAKE2b-256 9e0a28a4cacc00e9ba994deb9590c34f247f3df90f2e5b4d0c82b9e9a0513291

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 38e9414472bd37da84fc77d04687fa7510fb3034267b6e119e3d63d6fb835736
MD5 24ddc0bd6b3e25d7d3422eb25409ea90
BLAKE2b-256 74547ae6b5efd2f935edb9124761b0f75cd959c8d5be1df43ede89f97287dc35

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a38e2942386df5e62c3cebfa7e5a313e7ce5b0716669746619dc74fc1a0252d2
MD5 88a1c01cd089ecdb6dafc26a186b754e
BLAKE2b-256 1c7f423e92f53f84ece77bd42f78a1e94dd5dc63ae82837c358b3652ad78ee1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 08b3d517401d23389a4c0cde8bf00188fcd5b95731ad479f62ba50b951860555
MD5 4d0e47262b37956f03b95a15c845bc82
BLAKE2b-256 fec9f813915587fdc11390af7d9e66ebe72150508aa560c3cb28ef5f3c571fe4

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 56e9fea4510eb4032095d46ed00be7814a3d00ee9c139218e442552eabd3bb9a
MD5 382ad00a8077144dfcc306bb11f5f9b9
BLAKE2b-256 9c79f17309455155400f8cc5fb7f9eb39d79e20ba7d0c974a72656b0f6edf3a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7a97653053f118ba5f58b97ee3dbc36611c58f897d40d9a3e9f42e50edd5cdd0
MD5 ec42269471130db55d53315e52896358
BLAKE2b-256 c76d1a8d00db391299559efeab1104de7d422e33890bd015ef5d9505016bb9d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 05dfcb2f03d328d77941a0d515596facecbac9fd2409f5352e0d1e95260da26e
MD5 4111fa283e4474e9a1ad6348f9f52f9a
BLAKE2b-256 3fed4af8e3b906fb8e482b046cf09777d294bfd1c3b604d220a8926ec5e7e650

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 b4c84595c02f89a085eebf60e7eb2a1f21d25d2f3c8f4b9620f41ec099168eb9
MD5 7bba46f87cf491a838c64a6b7206c0c2
BLAKE2b-256 26023cce725c74b56f30afd4c5abb8a99db67699daca85ebc0810ebb51de5e4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f114eba7bcf81297ef996da7b2777096ff78c268e02c99432d7cc73d88983826
MD5 cf5a3e0089fef197415775a2ce77bb96
BLAKE2b-256 a57dac3bb58429f4806a883938644a36ad5df9adbae180741cce3343d0b69df8

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.3-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.3-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 3b68cd450c0a16d1de9cc8afea655067dc290781f6589a0c1a2650b399604098
MD5 5fd679553f596b236f54bd6b5a3e7293
BLAKE2b-256 b12467a8fe7a081cf53c5b192569f43ef458c23efaf5435b234d125f4aac520d

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.3-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page