PyPDF2 Parser for Swarmauri.
Project description
Swarmauri Parser PyPDF2
Lightweight PDF parser for Swarmauri that uses PyPDF2 to extract text from each page. Returns a Document per page with metadata describing the source file and page number.
Features
- Handles PDF input from file paths or raw bytes.
- Produces one
Documentper page, storing text incontentand metadata fields (page_number,source). - Gracefully returns an empty list if PyPDF2 cannot extract text from a page (e.g., scanned PDFs without OCR).
Prerequisites
- Python 3.10 or newer.
- PyPDF2 (installed automatically). For encrypted PDFs, ensure you provide access credentials before parsing.
Installation
# pip
pip install swarmauri_parser_pypdf2
# poetry
poetry add swarmauri_parser_pypdf2
# uv (pyproject-based projects)
uv add swarmauri_parser_pypdf2
Quickstart
from swarmauri_parser_pypdf2 import PyPDF2Parser
parser = PyPDF2Parser()
documents = parser.parse("manuals/device.pdf")
for doc in documents:
print(doc.metadata["page_number"], doc.content[:120])
Parsing PDF Bytes
from swarmauri_parser_pypdf2 import PyPDF2Parser
with open("statements/bank.pdf", "rb") as f:
pdf_bytes = f.read()
parser = PyPDF2Parser()
pages = parser.parse(pdf_bytes)
print(len(pages), "pages parsed from bytes")
Tips
- PyPDF2 extracts text only when the PDF contains accessible text objects. For scanned documents, run OCR first (e.g., with
swarmauri_ocr_pytesseract). - Remove or handle password protection before parsing; PyPDF2 cannot decrypt files without the password.
- Combine this parser with Swarmauri chunkers or summarizers to process large documents efficiently.
Want to help?
If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swarmauri_parser_pypdf2-0.8.3.tar.gz.
File metadata
- Download URL: swarmauri_parser_pypdf2-0.8.3.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c0338db5f004e59024aa598a3b2f01d153504df09e4444ae38544b4e422fb8d
|
|
| MD5 |
d359363166f55ce19a3b5d32e6da7fa3
|
|
| BLAKE2b-256 |
ce5a384d271466a9b9fc87f065f9cf1fb7d541a2b6fa167c0a9549f85d93bb3c
|
File details
Details for the file swarmauri_parser_pypdf2-0.8.3-py3-none-any.whl.
File metadata
- Download URL: swarmauri_parser_pypdf2-0.8.3-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b51f4c7a3d1fdedebd370fd98b0debd16e143567d6dda97399c92427919c55d1
|
|
| MD5 |
5c8e8454160b2ca4517bf3a43914ab29
|
|
| BLAKE2b-256 |
b1194f6d05c671f588786ed17cc18ee6c3d095da9ab62508ff6f3ee18b5dcb2f
|