Skip to main content

PyPDF2 Parser for Swarmauri.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_pypdf2


Swarmauri Parser PyPDF2

Lightweight PDF parser for Swarmauri that uses PyPDF2 to extract text from each page. Returns a Document per page with metadata describing the source file and page number.

Features

  • Handles PDF input from file paths or raw bytes.
  • Produces one Document per page, storing text in content and metadata fields (page_number, source).
  • Gracefully returns an empty list if PyPDF2 cannot extract text from a page (e.g., scanned PDFs without OCR).

Prerequisites

  • Python 3.10 or newer.
  • PyPDF2 (installed automatically). For encrypted PDFs, ensure you provide access credentials before parsing.

Installation

# pip
pip install swarmauri_parser_pypdf2

# poetry
poetry add swarmauri_parser_pypdf2

# uv (pyproject-based projects)
uv add swarmauri_parser_pypdf2

Quickstart

from swarmauri_parser_pypdf2 import PyPDF2Parser

parser = PyPDF2Parser()
documents = parser.parse("manuals/device.pdf")

for doc in documents:
    print(doc.metadata["page_number"], doc.content[:120])

Parsing PDF Bytes

from swarmauri_parser_pypdf2 import PyPDF2Parser

with open("statements/bank.pdf", "rb") as f:
    pdf_bytes = f.read()

parser = PyPDF2Parser()
pages = parser.parse(pdf_bytes)
print(len(pages), "pages parsed from bytes")

Tips

  • PyPDF2 extracts text only when the PDF contains accessible text objects. For scanned documents, run OCR first (e.g., with swarmauri_ocr_pytesseract).
  • Remove or handle password protection before parsing; PyPDF2 cannot decrypt files without the password.
  • Combine this parser with Swarmauri chunkers or summarizers to process large documents efficiently.

Want to help?

If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_pypdf2-0.8.3.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarmauri_parser_pypdf2-0.8.3-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file swarmauri_parser_pypdf2-0.8.3.tar.gz.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.8.3.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.8.3.tar.gz
Algorithm Hash digest
SHA256 6c0338db5f004e59024aa598a3b2f01d153504df09e4444ae38544b4e422fb8d
MD5 d359363166f55ce19a3b5d32e6da7fa3
BLAKE2b-256 ce5a384d271466a9b9fc87f065f9cf1fb7d541a2b6fa167c0a9549f85d93bb3c

See more details on using hashes here.

File details

Details for the file swarmauri_parser_pypdf2-0.8.3-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.8.3-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.8.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b51f4c7a3d1fdedebd370fd98b0debd16e143567d6dda97399c92427919c55d1
MD5 5c8e8454160b2ca4517bf3a43914ab29
BLAKE2b-256 b1194f6d05c671f588786ed17cc18ee6c3d095da9ab62508ff6f3ee18b5dcb2f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page