Skip to main content

A fast PDF extractor; a 500 pages/s alternative to Marker, Docling, PyMUPDF4LLM & others.

Project description

FibrumPDF

Extract 200+ pages per second on CPU.

Gives you tables, text, and their formatting, plus lower level information like bounding boxes and font sizes.

It outputs JSON for programmatic use, but still allows for Markdown.

Written for Python, Go at the core, with a touch of C to interface with MuPDF.

  • ~50×+ faster than pymupdf4llm and docling
  • Moderate drop in table precision/recall
  • Comparable structural extraction quality (TEDS)

Speed Quality Full benchmark information here


Installation

pip install fibrum-pdf

There are wheels for Python 3.9–3.14 (inclusive of minor versions) on macOS (ARM/x64) and all modern Linux distributions, and Windows x64.

To build from source, see BUILD.md.


What it's good at

  • Speed.
  • Custom logic.
  • No GPU needed.
  • Iterating on parsing logic without waiting hours.

What it's bad at

  • Scanned PDFs and images. It doesn't extract images, nor parse them
  • Complex layouts (think Forms, spreadsheet-style documents)
  • Lower precision and recall for tables compared to ML-based extractors
  • doesn't extract code blocks and it's (very) slightly behind on formatting

Usage

Basic

from fibrum_pdf import to_json

result = to_json("example.pdf", output="example.json")
print(f"Extracted to: {result.path}")

You can omit the output field; it defaults to <file>.json

Collect all pages in memory

result = to_json("report.pdf", output="report.json")
pages = result.collect()

# Access pages as objects with markdown conversion
for page in pages:
    print(page.markdown)
    
# Access individual blocks
for block in pages[0]:
    print(f"Block type: {block.type}")
    print(f"Has {len(block.spans)} spans")

This still saves it to result.path; it just allows you to load it into memory. If you don't want to write to disk at all, consider providing a special path.

This is only for smaller PDFs. For larger ones, this may result in crashes due to loading everything into RAM. See below for a solution.

stream pages (memory-efficient)

result = to_json("large.pdf", output="large.json")

# Iterate one page at a time without loading everything
for page in result:
    for block in page:
        print(f"Block type: {block.type}")

Markdown

result = to_json("document.pdf", output="document.json")
pages = result.collect()

# Full document as markdown
full_markdown = pages.markdown

# Single page as markdown
page_markdown = pages[0].markdown

# Single block as markdown
block_markdown = pages[0][0].markdown

Command-line

python -m fibrum_pdf.main input.pdf [output_dir]

Output structure

Each page is a JSON array of blocks. Every block has:

  • type: block type (text, heading, paragraph, list, table, code)
  • bbox: [x0, y0, x1, y1] bounding box coordinates
  • font_size: font size in points (average for multi-span blocks)
  • length: character count
  • spans: array of styled text spans with style flags (bold, italic, mono-space, etc.)

Note that a span represents a logical group of styling. You'll find that most blocks only have one span.

Span fields

  • text: span content
  • font_size: size in points
  • bold, italic, monospace, strikeout, superscript, subscript: boolean style flags
  • link: boolean indicating if span contains a hyperlink
  • uri: URI string if linked, otherwise false

See models.py.


FAQ

why not XXX? There are tools that are much better in quality. These are typically reliant on some sort of ML or OCR, making them slow and GPU-dependent. There are also tools that are extremely fast, but only give you raw text; which isn't helpful. Hopefully, this is fast and good enough.

Will this handle my XXX PDF?   It won't handle scanned documents, images or weird layouts and elements (think Forms in PDFs and spreadsheet-like documents).

Commercial use?   This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

Motivations? I got bored waiting for my documents to get chunked again.


Benchmark information

The datalab-to/marker-dataset from Hugging Face is used. Results are generated by benchmark/.

Test system: AMD Ryzen 7 4800H (16 cores), GTX 1650 TI (for Docling)

You can review all the code and run it yourself via the Typer CLI, or also review the benchmark/results/ directory.

Benchmark results


Optimization (how is it fast?)

Most of the performance benefits are thanks to others' hard work :)

  • MuPDF is written in pure C, has many performance micro-optimizations, and is extremely high quality. This does all the hard PDF work, and so, this is a major reason.

  • Compared to Docling and pymupdf4llm, Fibrum avoids ML/AI and instead uses heuristics. This trades some accuracy for significantly better performance.

  • Go and C are compiled, and since the logic is CPU heavy, the difference compared to Python, for example, is major here.

  • MuPDF cannot be safely multithreaded with shared state, so parallelism is achieved with fork. Each process has its own memory space, allowing near-linear scaling with core count.

  • The parallelism is aggressive. We intentionally oversubscribe on goroutines to allow the CPU to be fully saturated, for example, when the CPU pauses for the GC or RAM, a goroutine is always available to take it's place for a bit if we use 2-3x more.

  • Instead of relying on CGO/FFI, intermediate data is written as raw buffers and read by Go using zero-copy-style views. This avoids repeated boundary crossings and large memory copies. The workload is CPU-bound, so disk I/O (largely handled by the OS page cache) is not the bottleneck.

Licensing

This project uses MuPDF, which is under the AGPL-v3 license, or optionally a paid license from Artifex Software.

See LICENSE for the detail.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fibrum_pdf-1.1.1-cp314-cp314-win_amd64.whl (41.5 MB view details)

Uploaded CPython 3.14Windows x86-64

fibrum_pdf-1.1.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.1-cp314-cp314-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

fibrum_pdf-1.1.1-cp313-cp313-win_amd64.whl (41.1 MB view details)

Uploaded CPython 3.13Windows x86-64

fibrum_pdf-1.1.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.1-cp313-cp313-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fibrum_pdf-1.1.1-cp312-cp312-win_amd64.whl (41.1 MB view details)

Uploaded CPython 3.12Windows x86-64

fibrum_pdf-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.1-cp312-cp312-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fibrum_pdf-1.1.1-cp311-cp311-win_amd64.whl (41.1 MB view details)

Uploaded CPython 3.11Windows x86-64

fibrum_pdf-1.1.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (79.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fibrum_pdf-1.1.1-cp311-cp311-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fibrum_pdf-1.1.1-cp310-cp310-macosx_15_0_arm64.whl (80.6 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file fibrum_pdf-1.1.1-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.1-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 41.5 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.1-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 26ceb20a15bc6e05d781813838ac8db4168be0c665b9426250abea83ceeacc9a
MD5 28631b44b8087d35cc0b811f308c2c18
BLAKE2b-256 499402f04842a4b6e0d88b5a8d6645aa9de33819a7c08aa9a4a2f8177a0470cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp314-cp314-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5bd083f338fcd58914970af07bc953ebdda10b2b5a131754d728c4d6d5e54bde
MD5 11d2ec1a0e07a050f2857e3233275ca0
BLAKE2b-256 d8a5c0f4d99ed8c7eb040b92e3c9a074acce3d50794d6707093b013cfa0f3645

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 1d8611772acc0d0ce7b0d590874d74408473f7624fe321a4b662909845797f3c
MD5 ff612d14dc808ee267172d1c4d7ece05
BLAKE2b-256 8fc5289a1403e62fc2239264fead9de2d4b89d6c2e09aa7c5c8a934aaf9305c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 41.1 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 77d086573601d98e714dc5d4381e006a970f96ce2a67da7879e221a2241e5a6c
MD5 44b05a802a49da378e60dae396362872
BLAKE2b-256 9966b300dc0e774b0e2bd87105ace99b8393f0f4d0a8b25a960bfe9a04dcbdf6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5067208e9b1c19ffc81e324aa6cecc080342c822af8f7401e50618857b2a3c1e
MD5 3b869256f2e14e7927ad66445c874ff8
BLAKE2b-256 eb1b2b40c7c83c5f94efd9e31ac2439462a1faba350a6888b060f433ba6ba07b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 92e7ed3d42885201c29eaaab1a6b6fcedf5158536522cb82c59b794d88549633
MD5 a0f61d74e077f98d19e0929b23c6cea3
BLAKE2b-256 7691d5dfe97206850e9fbb602593676d96985f7d6d9166d68ee0a6f828ee0ba8

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 41.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6b25d6b76e7777c5b1b0cbf8edda71e46f64d1cd36a1c526259ed9224e7630c8
MD5 e8d1fd96621bd631307af979899bbbf4
BLAKE2b-256 3a4b4e237fcca06486c19031d50d0a38f4c1fa90365548b1ef06531c48ae2af6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 07ccc5a17842124be48368c04037a02eb8f44d29b8f51b42b8bf7f0d60c59c40
MD5 82ca1aa6508894e8411b7b62424f752b
BLAKE2b-256 9ae1c7f8d13bed39884ee925ab2c7eba046a74a2c0950404d02f573a5ba7eb18

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 e649d1eb81c6d02ae1e9d05037751f5d2b15558351e8b700697aa0cf347fc766
MD5 c54eb1bdc2982190299b344fc39464a2
BLAKE2b-256 da5c9cc9bccdd45369ef56934dfb19d938e2f1e72b0299797d4943e02b619922

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: fibrum_pdf-1.1.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 41.1 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fibrum_pdf-1.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c9f9ff9460e09661b8fbdc890d42ddda9299e044a82c9d104d155e4d2e8a0a25
MD5 da5669da57dcf10f8bb52016c9498df4
BLAKE2b-256 82d2e68dfb44b1a0170faf12a96cacf62ca86007257c8ab6554068a465fadb38

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 775a2e925b8690cf07c3e0081635a41b48b96db2c22ee2c3f15e68827f475449
MD5 05f418987ddba1eada35a1b815c2b48a
BLAKE2b-256 ffac4becae78624a3a3630bd6553424b5fedcae2179feb5a6a8cebc2a5d146cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 261862c3900fe09ac55abc74fbda92a04bb061b52a5387421180587445539968
MD5 2fb852fd00f682cc8e001daf2f82eafe
BLAKE2b-256 61238c25448e5bc824c0a00ca0c9c088cd7225c1d966a53f8fa72b82a8d13d04

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fibrum_pdf-1.1.1-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fibrum_pdf-1.1.1-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 493a0a6389848f9d7f4d0bc3f116db2b3097ffb281ef045406b2d41769d7ce0d
MD5 8b246d85a6e69893fa3ff4fa0701e7e0
BLAKE2b-256 44dbf3594860824f709fba50c9b32c61bbc908959959ca8bc56a15f7ad14b2b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for fibrum_pdf-1.1.1-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: publish.yml on intercepted16/fibrumpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page