High-performance PDF text parser for Swarmauri using PyMuPDF with aggregated whole-document output.
Project description
Swarmauri Parser Fitz PDF
swarmauri_parser_fitzpdf is the Swarmauri PDF parser for high-performance
text extraction using PyMuPDF. It opens a
PDF, extracts text from every page, and returns a single Swarmauri Document
with the aggregated content and source metadata.
Why Use Swarmauri Parser Fitz PDF
- Use PyMuPDF's fast document engine for PDF extraction inside Swarmauri ingestion and indexing pipelines.
- Produce one normalized
Documentfor whole-file workflows such as summarization, classification, or chunking after parse. - Keep PDF parsing logic aligned with the Swarmauri parser interface used by other loaders and processors.
- Stay flexible if you later need PyMuPDF-specific extraction modes or OCR augmentation upstream.
FAQ
What does this parser return?
A list containing one SwarmauriDocumentwhosecontentholds the combined extracted text for the PDF.
Does it return one document per page?
No. This parser aggregates all page text into a single document.
Can it parse scanned PDFs with no text layer?
Not by itself. PyMuPDF extracts text objects already present in the document. Scan-only PDFs should be OCR'd first.
What input type does it expect?
A file path string pointing to a local PDF.
Features
- Aggregated PDF text extraction through PyMuPDF.
- Preserves the original source path in document metadata.
- Uses a lightweight Swarmauri parser surface for document pipelines.
- Appropriate for whole-document ingestion, chunking, and retrieval setup.
- Supports Python 3.10, 3.11, 3.12, 3.13, and 3.14.
Installation
uv add swarmauri_parser_fitzpdf
pip install swarmauri_parser_fitzpdf
Usage
from swarmauri_parser_fitzpdf import FitzPdfParser
parser = FitzPdfParser()
documents = parser.parse("reports/quarterly.pdf")
for document in documents:
print(document.metadata["source"])
print(document.content[:500])
Examples
Parse a PDF into a single document
from swarmauri_parser_fitzpdf import FitzPdfParser
parser = FitzPdfParser()
docs = parser.parse("whitepapers/roadmap.pdf")
if docs:
print(len(docs[0].content))
Handle invalid input safely
from swarmauri_parser_fitzpdf import FitzPdfParser
parser = FitzPdfParser()
try:
docs = parser.parse("missing.pdf")
if not docs:
print("Parsing failed or returned no text.")
except ValueError as exc:
print(exc)
Related Packages
Swarmauri Foundations
More Documentation
Best Practices
- Use this parser when you want a whole-document text payload rather than page-by-page output.
- Use OCR earlier in the flow for scan-only documents that have no extractable text layer.
- Cache parse output for large PDFs if the same files are processed repeatedly.
- If reading order matters, verify the extracted output on representative documents because PDF text order depends on document structure.
License
This project is licensed under the Apache-2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swarmauri_parser_fitzpdf-0.11.0.dev1.tar.gz.
File metadata
- Download URL: swarmauri_parser_fitzpdf-0.11.0.dev1.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70acf139f20bc5312846d241267667836025c7d1fdb25a9f1e0e78b21c0eb608
|
|
| MD5 |
8a8bd5ae79ec0e53cb1a6e6fe5c72663
|
|
| BLAKE2b-256 |
86467d5237bfad7ebcb1eb0a0fa266b0eac50da29852609f355cae2776cd8d53
|
File details
Details for the file swarmauri_parser_fitzpdf-0.11.0.dev1-py3-none-any.whl.
File metadata
- Download URL: swarmauri_parser_fitzpdf-0.11.0.dev1-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c79423454b8bfc5ee0ea9c2c7397505f67b72c8b4be2519279c4f190614d6933
|
|
| MD5 |
d4d3a918c2b282b435c1fc65e5e6c124
|
|
| BLAKE2b-256 |
dd0588786d654b76facd241edfa146de90a2f271ded1c0ecbe0d4aba78059c2c
|