Skip to main content

A fast PDF extractor; a 500 pages/s alternative to Marker, Docling, PyMUPDF4LLM & others.

Project description

FibrumPDF

Extract 500+ pages a second on CPU.

Gives you tables, text, and their formatting, plus lower level information like bounding boxes and font sizes.

It outputs JSON for programmatic use, but still allows for Markdown.

Written for Python, Go at the core, with a touch of C to interface with MuPDF.

Full performance breakdown here


Installation

pip install fibrum-pdf

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions.

To build from source, see BUILD.md.


What it's good at

  • Speed.
  • Custom logic.
  • No GPU needed.
  • Iterating on parsing logic without waiting hours.

What it's bad at

  • Scanned PDFs and images. It doesn't extract images, nor parse them.
  • Complex layouts (think Forms, spreadsheet-style documents)

Usage

Basic

from fibrum_pdf import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

Collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"Block type: {block.type}")
    print(f"Has {len(block.spans)} spans")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

Markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

Command-line

python -m fibrum_pdf.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. You'll find that most blocks only have one span.

Span fields

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

See models.py.


FAQ

why not XXX? There are tools that are much better in quality. These are typically reliant on some sort of ML or OCR, making them slow and GPU-dependent. There are also tools that are extremely fast, but only give you raw text; which isn't helpful. Hopefully, this is fast and good enough.

Will this handle my XXX PDF?   It won't handle scanned documents, images or weird layouts and elements (think Forms in PDFs and spreadsheet-like documents).

Commercial use?   This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

Motivations? I got bored waiting for my documents to get chunked again.


Performance Breakdown

Using go/cmd/tomd/main.go with input_pdf [output_dir], I measured performance on:

  • ~1600 page document (path not available)
  • ~150 page document (test_data/pdfs/nist.pdf)

Performance depends on document size and available cores. With more pages to saturate your cores, you may see better throughput. Wall-clock time should scale approximately linearly with core count.

Test system: AMD Ryzen 7 4800H (8 cores, 6 used)

Runtime breakdown:

  • Go code: ~25% of runtime
  • MuPDF: ~75% of runtime

On the NIST document (150 pages): Go spent 78ms out of 363ms total (21%), MuPDF spent 285ms (79%).

Calculated average:

  • 1600 pages in 3000ms + 150 pages in 350ms = 1750 pages in 3350ms
  • ~520 pages/second

Licensing

This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

See LICENSE for the detail.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fibrum_pdf-1.0.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.4-cp314-cp314-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

fibrum_pdf-1.0.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.4-cp313-cp313-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fibrum_pdf-1.0.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.4-cp312-cp312-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fibrum_pdf-1.0.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.4-cp311-cp311-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fibrum_pdf-1.0.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.4-cp310-cp310-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fibrum_pdf-1.0.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (40.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.0.4-cp39-cp39-macosx_15_0_arm64.whl (41.7 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fibrum_pdf-1.0.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c8ac4e43f1fd95cfbf5bc0387ba88a3267d62eda719dfcd215eb3d9eff753c88
MD5 d0701cbb041613fb8d44fe59d175cecd
BLAKE2b-256 a8e2df70ebac8899a60f2dffbf810ea13ebbc6c212093fd09cb67875fe444468

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d8158d941591f71eaecaedcb3e379466f2a31142974d33b7c1943d050a7275f1
MD5 6f9255aad9011d98b6ab9c506b9b888c
BLAKE2b-256 40cf6e4219461c9449b33a7ccb3d2a21d07db00f791684032cce3e409c3ba5c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0bd0f80b2757afd5f2244ee3e4a22e8160dc3ed12fd1d59506410bb66a1282d5
MD5 6ad4416e43c06d88eb5c0c5e1b1520f1
BLAKE2b-256 a7208cca9bd26c51c389e9ab8b14bbdc1d18049aeb6934a84a555480a3df8eeb

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 f70f7bd75f540802dfe388b8b91bce1282b3078d9b1b7322f22b6739a2b89612
MD5 8cbf145253f92a458a38fb444acf41b8
BLAKE2b-256 db10c11207d087ac8b76d5b3464711f8f1a64b996cba7be45fa6f2d322777303

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a96932234a7bdc7f108db11804c7402ea07a6ccfe1b02a13767bc135b55f0bdd
MD5 b63f0977a640f80483c209b077f98a2f
BLAKE2b-256 5b095e2e54237d024e3ac93541232f6237c089815373a352c1c516dfa8a3c3d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 082788f6f2c42cf13c9c3e18dcecf3d892dd6f714c45d58e9555472198df0fc6
MD5 113532417f6635f4527451867d60f712
BLAKE2b-256 04a3d5cdeb572d17de4bcfb1c9bdc084c9fe11263cfcf2c84fef9d1901aacd7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e263bd02cfd1b1888f4502e74e53167e358354fd7bd24ca3f80a0348ef62ec9b
MD5 89e4335af3fff513c17a90f2b77803d8
BLAKE2b-256 7be1ca4fa7de8a509a431b50fe40d040e43b1ef91a6bd07cadeb9fe870695fab

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 37f4505663e2d44716ee5e865b3b302eeb85b0ba2875559b863b684d1928d055
MD5 58e5e77a445028e52ba694c8010bf365
BLAKE2b-256 e6144d3ffc85f28015f741e68551d382a87dc2cce1707c512427e59386434b03

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 052ae39d85ddff2209c8fc91c196dfb73c47234036e15a860a4a2189bdc90200
MD5 1134f15175827259e6ba150dc50ea833
BLAKE2b-256 316da01cfc3b39d25a67bae45e0eff1b1ddb2f269eb7c213bb49ad4836a0ea25

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 df83104bc73fdd52e01f95a077cba96b983d8dc92adb448918aeabc8691c923f
MD5 748126878e5e29df56e90d3ff6f8875b
BLAKE2b-256 419ff73a1dadfb65815dd11938fb4485ed72d151210b401493f0d15ccb1f9087

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c42ac8397026d9be0e72a2512bc615eb88aadaec5ef0fe292de48c2c2366ef8f
MD5 51aae32617686d767a26e6bd94a672a2
BLAKE2b-256 ee20d8f1859c42133ae67118838d5b541e0018e8d38f1a8bdf732ef238785742

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.0.4-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.0.4-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 1c61627c253e7c676f199e783615033e1caaa0e1bb773bdfd704b3ffb64602ef
MD5 2271c8add9c33640436c90838634347f
BLAKE2b-256 11832b8db82d373022f6c0cc651d417746aeb8efdb2fcecfbe5e13122ca976f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.0.4-cp39-cp39-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page