Skip to main content

High-performance PDF text parser for Swarmauri using PyMuPDF with aggregated whole-document output.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_fitzpdf Discord

Swarmauri Parser Fitz PDF

swarmauri_parser_fitzpdf is the Swarmauri PDF parser for high-performance text extraction using PyMuPDF. It opens a PDF, extracts text from every page, and returns a single Swarmauri Document with the aggregated content and source metadata.

Why Use Swarmauri Parser Fitz PDF

  • Use PyMuPDF's fast document engine for PDF extraction inside Swarmauri ingestion and indexing pipelines.
  • Produce one normalized Document for whole-file workflows such as summarization, classification, or chunking after parse.
  • Keep PDF parsing logic aligned with the Swarmauri parser interface used by other loaders and processors.
  • Stay flexible if you later need PyMuPDF-specific extraction modes or OCR augmentation upstream.

FAQ

What does this parser return?
A list containing one Swarmauri Document whose content holds the combined extracted text for the PDF.

Does it return one document per page?
No. This parser aggregates all page text into a single document.

Can it parse scanned PDFs with no text layer?
Not by itself. PyMuPDF extracts text objects already present in the document. Scan-only PDFs should be OCR'd first.

What input type does it expect?
A file path string pointing to a local PDF.

Features

  • Aggregated PDF text extraction through PyMuPDF.
  • Preserves the original source path in document metadata.
  • Uses a lightweight Swarmauri parser surface for document pipelines.
  • Appropriate for whole-document ingestion, chunking, and retrieval setup.
  • Supports Python 3.10, 3.11, 3.12, 3.13, and 3.14.

Installation

uv add swarmauri_parser_fitzpdf
pip install swarmauri_parser_fitzpdf

Usage

from swarmauri_parser_fitzpdf import FitzPdfParser

parser = FitzPdfParser()
documents = parser.parse("reports/quarterly.pdf")

for document in documents:
    print(document.metadata["source"])
    print(document.content[:500])

Examples

Parse a PDF into a single document

from swarmauri_parser_fitzpdf import FitzPdfParser

parser = FitzPdfParser()
docs = parser.parse("whitepapers/roadmap.pdf")

if docs:
    print(len(docs[0].content))

Handle invalid input safely

from swarmauri_parser_fitzpdf import FitzPdfParser

parser = FitzPdfParser()

try:
    docs = parser.parse("missing.pdf")
    if not docs:
        print("Parsing failed or returned no text.")
except ValueError as exc:
    print(exc)

Related Packages

Swarmauri Foundations

More Documentation

Best Practices

  • Use this parser when you want a whole-document text payload rather than page-by-page output.
  • Use OCR earlier in the flow for scan-only documents that have no extractable text layer.
  • Cache parse output for large PDFs if the same files are processed repeatedly.
  • If reading order matters, verify the extracted output on representative documents because PDF text order depends on document structure.

License

This project is licensed under the Apache-2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_fitzpdf-0.11.0.dev1.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file swarmauri_parser_fitzpdf-0.11.0.dev1.tar.gz.

File metadata

  • Download URL: swarmauri_parser_fitzpdf-0.11.0.dev1.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_fitzpdf-0.11.0.dev1.tar.gz
Algorithm Hash digest
SHA256 70acf139f20bc5312846d241267667836025c7d1fdb25a9f1e0e78b21c0eb608
MD5 8a8bd5ae79ec0e53cb1a6e6fe5c72663
BLAKE2b-256 86467d5237bfad7ebcb1eb0a0fa266b0eac50da29852609f355cae2776cd8d53

See more details on using hashes here.

File details

Details for the file swarmauri_parser_fitzpdf-0.11.0.dev1-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_fitzpdf-0.11.0.dev1-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_fitzpdf-0.11.0.dev1-py3-none-any.whl
Algorithm Hash digest
SHA256 c79423454b8bfc5ee0ea9c2c7397505f67b72c8b4be2519279c4f190614d6933
MD5 d4d3a918c2b282b435c1fc65e5e6c124
BLAKE2b-256 dd0588786d654b76facd241edfa146de90a2f271ded1c0ecbe0d4aba78059c2c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page