Skip to main content

PyPDF2 Parser for Swarmauri.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_pypdf2

---

Swarmauri Parser PyPDF2

Lightweight PDF parser for Swarmauri that uses PyPDF2 to extract text from each page. Returns a Document per page with metadata describing the source file and page number.

Features

  • Handles PDF input from file paths or raw bytes.
  • Produces one Document per page, storing text in content and metadata fields (page_number, source).
  • Gracefully returns an empty list if PyPDF2 cannot extract text from a page (e.g., scanned PDFs without OCR).

Prerequisites

  • Python 3.10 or newer.
  • PyPDF2 (installed automatically). For encrypted PDFs, ensure you provide access credentials before parsing.

Installation

# pip
pip install swarmauri_parser_pypdf2

# poetry
poetry add swarmauri_parser_pypdf2

# uv (pyproject-based projects)
uv add swarmauri_parser_pypdf2

Quickstart

from swarmauri_parser_pypdf2 import PyPDF2Parser

parser = PyPDF2Parser()
documents = parser.parse("manuals/device.pdf")

for doc in documents:
    print(doc.metadata["page_number"], doc.content[:120])

Parsing PDF Bytes

from swarmauri_parser_pypdf2 import PyPDF2Parser

with open("statements/bank.pdf", "rb") as f:
    pdf_bytes = f.read()

parser = PyPDF2Parser()
pages = parser.parse(pdf_bytes)
print(len(pages), "pages parsed from bytes")

Tips

  • PyPDF2 extracts text only when the PDF contains accessible text objects. For scanned documents, run OCR first (e.g., with swarmauri_ocr_pytesseract).
  • Remove or handle password protection before parsing; PyPDF2 cannot decrypt files without the password.
  • Combine this parser with Swarmauri chunkers or summarizers to process large documents efficiently.

Want to help?

If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_pypdf2-0.8.4.dev2.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarmauri_parser_pypdf2-0.8.4.dev2-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file swarmauri_parser_pypdf2-0.8.4.dev2.tar.gz.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.8.4.dev2.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.8.4.dev2.tar.gz
Algorithm Hash digest
SHA256 d7c8e48dd29396ed14e2474265205afa83cceb0af65ac83c3ae99552fc187444
MD5 df11cb2f1dc1579ab5e399a8c7a7499c
BLAKE2b-256 8e59642e5ab69d6f3665b3cfe5e6d96dde1600c96dbdd6b048da51f315e076d0

See more details on using hashes here.

File details

Details for the file swarmauri_parser_pypdf2-0.8.4.dev2-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.8.4.dev2-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.8.4.dev2-py3-none-any.whl
Algorithm Hash digest
SHA256 2195fc4bfd377356d52ef9eed932e958fd65515012d9c2cae0cde30336ef84cc
MD5 dd7678d7b0b6e7176e2cbce1b264aceb
BLAKE2b-256 d973194893214e8bb8ce44a4f881ee8b6d9eeba90b1eeef0cfbf149ec5dcaf6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page