Page-by-page PDF text parser for Swarmauri using PyPDF2 with path and bytes input support.
Project description
Swarmauri Parser PyPDF2
swarmauri_parser_pypdf2 is the Swarmauri PDF text parser for page-by-page
extraction using PyPDF2. It converts a PDF
file path or PDF bytes into a list of Swarmauri Document objects, preserving
page numbers and source metadata for downstream chunking, indexing, and agent
workflows.
Why Use Swarmauri Parser PyPDF2
- Extract embedded PDF text without introducing OCR when the document already contains readable text objects.
- Preserve page boundaries by returning one
Documentper page. - Accept either on-disk PDFs or PDF bytes already loaded in memory.
- Keep document ingestion aligned with the Swarmauri parser interface used across other loaders and converters.
FAQ
What does this parser return?
A list of SwarmauriDocumentobjects, usually one for each page that contains extractable text.
Can it parse PDF bytes instead of a file path?
Yes.parse()accepts either a string path orbytes.
Does it perform OCR on scanned PDFs?
No. PyPDF2 extracts embedded text. Scanned PDFs without text objects should be handled with OCR first.
What metadata is attached to each page?
Each returned document includespage_numberandsource.
Features
- Page-by-page PDF text extraction through PyPDF2.
- Supports parsing from local file paths and raw PDF bytes.
- Returns Swarmauri
Documentobjects withcontent,page_number, andsource. - Clean fit for ingestion pipelines, retrieval systems, and document analysis workflows.
- Supports Python 3.10, 3.11, 3.12, 3.13, and 3.14.
Installation
uv add swarmauri_parser_pypdf2
pip install swarmauri_parser_pypdf2
Usage
from swarmauri_parser_pypdf2 import PyPDF2Parser
parser = PyPDF2Parser()
documents = parser.parse("manuals/device.pdf")
for document in documents:
print(document.metadata["page_number"])
print(document.content[:160])
Examples
Parse a PDF from bytes
from pathlib import Path
from swarmauri_parser_pypdf2 import PyPDF2Parser
pdf_bytes = Path("reports/statement.pdf").read_bytes()
parser = PyPDF2Parser()
pages = parser.parse(pdf_bytes)
for page in pages:
print(page.metadata)
Send parsed pages to downstream chunking
from swarmauri_parser_pypdf2 import PyPDF2Parser
parser = PyPDF2Parser()
pages = parser.parse("contracts/master-service-agreement.pdf")
for page in pages:
if page.content.strip():
print(page.metadata["page_number"], len(page.content))
Related Packages
Swarmauri Foundations
More Documentation
Best Practices
- Use this parser when the PDF already contains embedded text.
- Use OCR first for scan-only PDFs, then parse or post-process the OCR output.
- Validate encrypted, malformed, or image-only PDFs earlier in the ingestion pipeline so downstream processing can route them correctly.
- Keep in mind that page text extraction quality depends on the PDF's internal structure and reading order.
License
This project is licensed under the Apache-2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swarmauri_parser_pypdf2-0.11.0.dev1.tar.gz.
File metadata
- Download URL: swarmauri_parser_pypdf2-0.11.0.dev1.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac803f2eedf98294c607181bf0392d83eb59f2710eb265ec69dfce9140b2a918
|
|
| MD5 |
8cd93ccd11fa629e87ad5aaf771ae269
|
|
| BLAKE2b-256 |
6a20d37b40c5af1a8f34ecf9bfbb733df455de0aa4fc21f6f1d59af0f3d8a7ce
|
File details
Details for the file swarmauri_parser_pypdf2-0.11.0.dev1-py3-none-any.whl.
File metadata
- Download URL: swarmauri_parser_pypdf2-0.11.0.dev1-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8de0a3dd78d8405742d15d957548f6f71c25b30e9240a6435a175c10df505562
|
|
| MD5 |
55155f2a878d380bf1c7df998fdc03fc
|
|
| BLAKE2b-256 |
d739cb13e9b3def8f42b0bc7f32bd91b7cd1fa7126517022232495708cd1c8d7
|