Skip to main content

PyPDF2 Parser for Swarmauri.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_pypdf2

Swarmauri Parser PyPDF2

Lightweight PDF parser for Swarmauri that uses PyPDF2 to extract text from each page. Returns a Document per page with metadata describing the source file and page number.

Features

  • Handles PDF input from file paths or raw bytes.
  • Produces one Document per page, storing text in content and metadata fields (page_number, source).
  • Gracefully returns an empty list if PyPDF2 cannot extract text from a page (e.g., scanned PDFs without OCR).

Prerequisites

  • Python 3.10 or newer.
  • PyPDF2 (installed automatically). For encrypted PDFs, ensure you provide access credentials before parsing.

Installation

# pip
pip install swarmauri_parser_pypdf2

# poetry
poetry add swarmauri_parser_pypdf2

# uv (pyproject-based projects)
uv add swarmauri_parser_pypdf2

Quickstart

from swarmauri_parser_pypdf2 import PyPDF2Parser

parser = PyPDF2Parser()
documents = parser.parse("manuals/device.pdf")

for doc in documents:
    print(doc.metadata["page_number"], doc.content[:120])

Parsing PDF Bytes

from swarmauri_parser_pypdf2 import PyPDF2Parser

with open("statements/bank.pdf", "rb") as f:
    pdf_bytes = f.read()

parser = PyPDF2Parser()
pages = parser.parse(pdf_bytes)
print(len(pages), "pages parsed from bytes")

Tips

  • PyPDF2 extracts text only when the PDF contains accessible text objects. For scanned documents, run OCR first (e.g., with swarmauri_ocr_pytesseract).
  • Remove or handle password protection before parsing; PyPDF2 cannot decrypt files without the password.
  • Combine this parser with Swarmauri chunkers or summarizers to process large documents efficiently.

Want to help?

If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_pypdf2-0.8.4.dev3.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarmauri_parser_pypdf2-0.8.4.dev3-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file swarmauri_parser_pypdf2-0.8.4.dev3.tar.gz.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.8.4.dev3.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.8.4.dev3.tar.gz
Algorithm Hash digest
SHA256 36473873f495e729d7e0f92652c8b563f9c009c80114e9b010b6e8a218f3067e
MD5 757bcd7a9a0d3e9364f6b35c4c0c3f06
BLAKE2b-256 5036ea5a14864166f59ea5df794252155dcdbab2beccaf021067ba908492d344

See more details on using hashes here.

File details

Details for the file swarmauri_parser_pypdf2-0.8.4.dev3-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.8.4.dev3-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.8.4.dev3-py3-none-any.whl
Algorithm Hash digest
SHA256 c4b97335186ebd7f3891fc200539ff9105dffe199960be9e1ae1814e7086189e
MD5 548cbcda104893c0b8204c7536e18879
BLAKE2b-256 5cfb0aa4474dc98bef388e59720031e5fda5b3f75acdd833941dcca09fee45c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page