Skip to main content

A fast PDF extractor; a 500 pages/s alternative to Marker, Docling, PyMUPDF4LLM & others.

Project description

FibrumPDF

Extract 500+ pages a second on CPU.

Gives you tables, text, and their formatting, plus lower level information like bounding boxes and font sizes.

It outputs JSON for programmatic use, but still allows for Markdown.

Written for Python, Go at the core, with a touch of C to interface with MuPDF.

Full performance breakdown here


Installation

pip install fibrum-pdf

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


What it's good at

  • Speed.
  • Custom logic.
  • No GPU needed.
  • Iterating on parsing logic without waiting hours.

What it's bad at

  • Scanned PDFs and images. It doesn't extract images, nor parse them.
  • Complex layouts (think Forms, spreadsheet-style documents)

Usage

Basic

from fibrum_pdf import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

Collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"Block type: {block.type}")
    print(f"Has {len(block.spans)} spans")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

Markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

Command-line

python -m fibrum_pdf.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. You'll find that most blocks only have one span.

Span fields

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

See models.py.


FAQ

why not XXX? There are tools that are much better in quality. These are typically reliant on some sort of ML or OCR, making them slow and GPU-dependent. There are also tools that are extremely fast, but only give you raw text; which isn't helpful. Hopefully, this is fast and good enough.

Will this handle my XXX PDF?   It won't handle scanned documents, images or weird layouts and elements (think Forms in PDFs and spreadsheet-like documents).

Commercial use?   This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

Motivations? I got bored waiting for my documents to get chunked again.


Performance Breakdown

Using go/cmd/tomd/main.go with input_pdf [output_dir], I measured performance on:

  • ~1600 page document (path not available)
  • ~150 page document (test_data/pdfs/nist.pdf)

Performance depends on document size and available cores. With more pages to saturate your cores, you may see better throughput. Wall-clock time should scale approximately linearly with core count.

Test system: AMD Ryzen 7 4800H (8 cores, 6 used)

Runtime breakdown:

  • Go code: ~25% of runtime
  • MuPDF: ~75% of runtime

On the NIST document (150 pages): Go spent 78ms out of 363ms total (21%), MuPDF spent 285ms (79%).

Calculated average:

  • 1600 pages in 3000ms + 150 pages in 350ms = 1750 pages in 3350ms
  • ~520 pages/second

Licensing

This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

See LICENSE for the detail.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fibrum_pdf-1.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (41.1 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.0-cp314-cp314-macosx_15_0_arm64.whl (42.1 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

fibrum_pdf-1.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (41.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.0-cp313-cp313-macosx_15_0_arm64.whl (42.1 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fibrum_pdf-1.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (41.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.0-cp312-cp312-macosx_15_0_arm64.whl (42.1 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fibrum_pdf-1.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (41.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.0-cp311-cp311-macosx_15_0_arm64.whl (42.1 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fibrum_pdf-1.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (41.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.0-cp310-cp310-macosx_15_0_arm64.whl (42.1 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fibrum_pdf-1.1.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (41.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.0-cp39-cp39-macosx_15_0_arm64.whl (42.1 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fibrum_pdf-1.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0582fffd40569c9a4e283222a9260fd71e0721d54dd7009b38cc094eee9b85d8
MD5 d570bf75fe6d064f210f7feb5cf08475
BLAKE2b-256 166f6bed409ea671c3e0a66979b93f910f96b7369dcc60a0f1f994247be4e386

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 dc398b899b268c6f34ebc328460595500b24ad4e83590ea0cb8b4f5040ff2489
MD5 8cd9273ae13849bf8ea6a17405d0f813
BLAKE2b-256 3974b9358f914ac997a5cafa4db9f2117545212d9485e92c5c36322e971be8ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fabb6d70bfe7fa7be843efe9b52dae1a682f63e5cd800d518cd29864629ff95a
MD5 ebb1b7db476ce59efc483cf9cef14b70
BLAKE2b-256 3d29a708381173273285ff581d2a93337165a2c9d6ed4aec9ca389e59f90651d

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7bdc69eb1d8a1a191f1b1fdddc0bae4a7b416aadfdc45478d0e9f5aa456068f7
MD5 3eeb86f1ba027a23fed1172b0a384884
BLAKE2b-256 d4c691b43d1d41015c53823dbe48491b969d8ae6d70479c2fa936f454cf91016

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 87d8ac8ee067e5352099b509b869ced1f55d2f7cb49f4a185ad394f1d8aecf01
MD5 290613b6bb60b9bf1189e047a7b0928f
BLAKE2b-256 0242adccf74e3ad92fbad2ebba1acc9db26a1ea7928a5d29f37b997d95ada0e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 f64006a4765457f0ad9711cedda6d961020a228852d73787002dbc2667644b01
MD5 579c00dc9467b693eba452a00315df37
BLAKE2b-256 5f7acb0a222c3a9f5d9bd782bd8823298daec702aebce9277afe825c08dfc957

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 66f5c736cf30cb7266144b9703a1233cb8c91bdab80ae10af0eb95bb4819cc87
MD5 4d251ff201bfcf75077853193cfdf32c
BLAKE2b-256 85e216ceca79259b3ca6dbb53f965e99045870bbd986772033acbf4271637a79

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 3a80a1143660edfdfa6fdce79920688121256ee828951c1b8cf82429e5c6c461
MD5 51c94b1c9032bc0feda9764a25d3d20a
BLAKE2b-256 20c86e1d0f9ed2e5eabc4ff476e1bb7cd2ddccc876c760d0ded048a82307ec51

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 26fb3d79144e452a2f00117fbd56c37e8c6bccc3f78fb77bb1532b78f851208a
MD5 ce67b65c5fcae5c2bb6377c5ae64998f
BLAKE2b-256 705d97f849022ed10027538aff9ed6bbf42eacb3f4ffd0aa4c175c9195f31dc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 01413aa35be6a0d589cd2fe2361d71d622c4fe741d27551fc40447687ca5b68b
MD5 2c981733cd7c4fbdab527818cd5546fd
BLAKE2b-256 a78b61e73405ad74f398652cc5958feca14fa3002138518385e03edf9ded7730

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ee4f6ece4c4ceebd2151bf0a000fc56764eac73213a382509ca6cb0456df73c6
MD5 3562179a5590ac29d07c6ff96f7f776a
BLAKE2b-256 7ad540e23facf527cd739d83c02736e5559a53eb79cc13a15add6fec543f52fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.0-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.0-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 29acfddd0522ffc9661dab297070b39bccb06980ab47f02c5e7e369529a26546
MD5 76c6ad35df19765870bcd10fa58819ec
BLAKE2b-256 af93665703abe5475bb5e84f05ebe804d81b4c25b21e1b6dd5bfeb294571ed7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.0-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page