Skip to main content

A fast PDF extractor; a 200 pages/s alternative to Marker, Docling, PyMUPDF4LLM & others.

Project description

FibrumPDF

Extract 200+ pages per second on CPU.

Gives you tables, text, and their formatting, plus lower level information like bounding boxes and font sizes.

It outputs JSON for programmatic use, but still allows for Markdown.

Written for Python, Go at the core, with a touch of C to interface with MuPDF.

  • ~50×+ faster than pymupdf4llm and docling
  • Moderate drop in table precision/recall
  • Comparable structural extraction quality (TEDS)

Speed Quality Full benchmark information here


Installation

pip install fibrum-pdf

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions, and Windows x64.

To build from source, see BUILD.md.


What it's good at

  • Speed.
  • Custom logic.
  • No GPU needed.
  • Iterating on parsing logic without waiting hours.

What it's bad at

  • Scanned PDFs and images. It doesn't extract images, nor parse them
  • Complex layouts (think Forms, spreadsheet-style documents)
  • Lower precision and recall for tables compared to ML-based extractors
  • doesn't extract code blocks and it's (very) slightly behind on formatting

Usage

Basic

from fibrum_pdf import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

Collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"Block type: {block.type}")
    print(f"Has {len(block.spans)} spans")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

Markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

Command-line

python -m fibrum_pdf.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. You'll find that most blocks only have one span.

Span fields

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

See models.py.


FAQ

why not XXX? There are tools that are much better in quality. These are typically reliant on some sort of ML or OCR, making them slow and GPU-dependent. There are also tools that are extremely fast, but only give you raw text; which isn't helpful. Hopefully, this is fast and good enough.

Will this handle my XXX PDF?   It won't handle scanned documents, images or weird layouts and elements (think Forms in PDFs and spreadsheet-like documents).

Commercial use?   This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

Motivations? I got bored waiting for my documents to get chunked again.


Benchmark information

The datalab-to/marker-dataset from Hugging Face is used. Results are generated by benchmark/.

Test system: AMD Ryzen 7 4800H (16 cores), GTX 1650 TI (for Docling)

You can review all the code and run it yourself via the Typer CLI, or also review the benchmark/results/ directory.

Benchmark results


Optimization (how is it fast?)

Most of the performance benefits are thanks to others' hard work :)

  • MuPDF is written in pure C, has many performance micro-optimizations, and is extremely high quality. This does all the hard PDF work, and so, this is a major reason.

  • Compared to Docling and pymupdf4llm, Fibrum avoids ML/AI and instead uses heuristics. This trades some accuracy for significantly better performance.

  • Go and C are compiled, and since the logic is CPU heavy, the difference compared to Python, for example, is major here.

  • MuPDF cannot be safely multithreaded with shared state, so parallelism is achieved with fork. Each process has its own memory space, allowing near-linear scaling with core count.

  • The parallelism is aggressive. We intentionally oversubscribe on goroutines to allow the CPU to be fully saturated, for example, when the CPU pauses for the GC or RAM, a goroutine is always available to take it's place for a bit if we use 2-3x more.

  • Instead of relying on CGO/FFI, intermediate data is written as raw buffers and read by Go using zero-copy-style views. This avoids repeated boundary crossings and large memory copies. The workload is CPU-bound, so disk I/O (largely handled by the OS page cache) is not the bottleneck.

Licensing

This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

See LICENSE for the detail.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fibrum_pdf-1.1.2-cp314-cp314-win_amd64.whl (41.5 MB view details)

Uploaded CPython 3.14Windows x86-64

fibrum_pdf-1.1.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.2-cp314-cp314-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

fibrum_pdf-1.1.2-cp313-cp313-win_amd64.whl (41.1 MB view details)

Uploaded CPython 3.13Windows x86-64

fibrum_pdf-1.1.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.2-cp313-cp313-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fibrum_pdf-1.1.2-cp312-cp312-win_amd64.whl (41.1 MB view details)

Uploaded CPython 3.12Windows x86-64

fibrum_pdf-1.1.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.2-cp312-cp312-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fibrum_pdf-1.1.2-cp311-cp311-win_amd64.whl (41.1 MB view details)

Uploaded CPython 3.11Windows x86-64

fibrum_pdf-1.1.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.2-cp311-cp311-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

File details

Details for the file fibrum_pdf-1.1.2-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.2-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 41.5 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.2-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 d636e781937d4a191f42e10991d11181407c2bc8fb64b3e96ebfb56f52b0488c
MD5 bba80e254aedc7b4000fa8459eec1f56
BLAKE2b-256 b6690139bd6122184d60cc0b584cf6ce80016b5fe31ace6c062ae081e58b5ac1

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp314-cp314-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ea7998f23b8624b60439a645c5d6c080dd0922db291ad8312cd197f76649c654
MD5 39acd123b5fbe69018d97476be0aa0f8
BLAKE2b-256 d586ba88ae4adf5331ea826041c8a20e0bf7698ecb73dc5b3116d41267aacfee

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2081ea0e9463e80683b20be213b0786df72664024c54443f3b5284217674bdb3
MD5 f3a8b2e3713652460cdcfcb95edfe1d3
BLAKE2b-256 64b4b8d84cf15745364eb6a63d08b63896579e5549ae5d0db056287057c9a3c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 41.1 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 e9ca538f77d86b705cbc76975c4f452e93d12a963ee5173af52facf1ffb0d7aa
MD5 02792484f954ed36b78ef3aff471501c
BLAKE2b-256 73da8dfb9dff3361551b65db7e1b979f6508ca63ec58eb4fa7542f3cf45749a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 450999dc2d86c0efcdb63a4f6fd777cc445168914fd9d697369e4097cfeb10ed
MD5 fd8cb3b5560ee1e9f6cd2ff11ea0ac1d
BLAKE2b-256 537310b852036f5127fd6528d6e6636149d74836a8d9531941430283e8653e7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ecb0a20123dfb4312d60d7acd001cdbfa8fbbaea8f04bfd4ebe1f7ed5f8fa590
MD5 767d1f0869925e324aaf15df2a333ac4
BLAKE2b-256 590feafb1c167c7aa5d6e54c156d53000a9e747bc1e46bc6905519a0203d56d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 41.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6bad3af93e07acbea52654867e26bb0446397149c1b1edefc9ffaeec13e20245
MD5 c3de965713d220a4d93108fac673dc47
BLAKE2b-256 50247f257913e2249049aba2f92d078db46b3c9623bbc6128dc30dca72fa06e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d2dd205314c98295ea284a130fcce1d20cfb625d1f71b76f35696ab9a40c6227
MD5 24f4a4660552b171241af51088354c8c
BLAKE2b-256 6d5c5a85c9335f125860752008254f193f9646f37dc18ef2eec057ad3de98d80

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 516e93f8f9de8705da76204b12daf90082ef03e04aab844743fddf432b82ec2b
MD5 97d6d15edb1350c89370a6a70dae3d17
BLAKE2b-256 e7d4f3137b1519ee341101a5a4be7be83dd4508979d02ac9543338ff84394b8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 41.1 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8fca524850d84a10d5f14699de2095c08789ebcafc17476a4c4beddc6291ed74
MD5 4569c80c6aefc2ea8bac7334d27f1b8a
BLAKE2b-256 212438455e793c85df1c1eeb9112e9aa1f10466ddf086f7872f1f31b7ed98998

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 572924281a80cd3bc2bc560fa3bfcd9967152d644a68fd219fef48af1b07cd51
MD5 b32c7662b829b6dff4b7806a988d9f18
BLAKE2b-256 7c35fb5917cbe1cd0db6c07dbb5f6619e44baa0a745903a4996f6e2330cf205b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.2-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.2-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6b6fdf7b7728fd432c8d9b138529ea0023367f5cbaa8f86008e52043b856abc0
MD5 fff15b3ebabc0617732172dd890a9381
BLAKE2b-256 7af9885dfa778ebc18949cc10e4fca3e48200fc7272f4e75e60ebdb43b4d7823

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.2-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page