Skip to main content

Page-by-page PDF text parser for Swarmauri using PyPDF2 with path and bytes input support.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_pypdf2 Discord

Swarmauri Parser PyPDF2

swarmauri_parser_pypdf2 is the Swarmauri PDF text parser for page-by-page extraction using PyPDF2. It converts a PDF file path or PDF bytes into a list of Swarmauri Document objects, preserving page numbers and source metadata for downstream chunking, indexing, and agent workflows.

Why Use Swarmauri Parser PyPDF2

  • Extract embedded PDF text without introducing OCR when the document already contains readable text objects.
  • Preserve page boundaries by returning one Document per page.
  • Accept either on-disk PDFs or PDF bytes already loaded in memory.
  • Keep document ingestion aligned with the Swarmauri parser interface used across other loaders and converters.

FAQ

What does this parser return?
A list of Swarmauri Document objects, usually one for each page that contains extractable text.

Can it parse PDF bytes instead of a file path?
Yes. parse() accepts either a string path or bytes.

Does it perform OCR on scanned PDFs?
No. PyPDF2 extracts embedded text. Scanned PDFs without text objects should be handled with OCR first.

What metadata is attached to each page?
Each returned document includes page_number and source.

Features

  • Page-by-page PDF text extraction through PyPDF2.
  • Supports parsing from local file paths and raw PDF bytes.
  • Returns Swarmauri Document objects with content, page_number, and source.
  • Clean fit for ingestion pipelines, retrieval systems, and document analysis workflows.
  • Supports Python 3.10, 3.11, 3.12, 3.13, and 3.14.

Installation

uv add swarmauri_parser_pypdf2
pip install swarmauri_parser_pypdf2

Usage

from swarmauri_parser_pypdf2 import PyPDF2Parser

parser = PyPDF2Parser()
documents = parser.parse("manuals/device.pdf")

for document in documents:
    print(document.metadata["page_number"])
    print(document.content[:160])

Examples

Parse a PDF from bytes

from pathlib import Path
from swarmauri_parser_pypdf2 import PyPDF2Parser

pdf_bytes = Path("reports/statement.pdf").read_bytes()
parser = PyPDF2Parser()
pages = parser.parse(pdf_bytes)

for page in pages:
    print(page.metadata)

Send parsed pages to downstream chunking

from swarmauri_parser_pypdf2 import PyPDF2Parser

parser = PyPDF2Parser()
pages = parser.parse("contracts/master-service-agreement.pdf")

for page in pages:
    if page.content.strip():
        print(page.metadata["page_number"], len(page.content))

Related Packages

Swarmauri Foundations

More Documentation

Best Practices

  • Use this parser when the PDF already contains embedded text.
  • Use OCR first for scan-only PDFs, then parse or post-process the OCR output.
  • Validate encrypted, malformed, or image-only PDFs earlier in the ingestion pipeline so downstream processing can route them correctly.
  • Keep in mind that page text extraction quality depends on the PDF's internal structure and reading order.

License

This project is licensed under the Apache-2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_pypdf2-0.11.0.dev1.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarmauri_parser_pypdf2-0.11.0.dev1-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file swarmauri_parser_pypdf2-0.11.0.dev1.tar.gz.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.11.0.dev1.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.11.0.dev1.tar.gz
Algorithm Hash digest
SHA256 ac803f2eedf98294c607181bf0392d83eb59f2710eb265ec69dfce9140b2a918
MD5 8cd93ccd11fa629e87ad5aaf771ae269
BLAKE2b-256 6a20d37b40c5af1a8f34ecf9bfbb733df455de0aa4fc21f6f1d59af0f3d8a7ce

See more details on using hashes here.

File details

Details for the file swarmauri_parser_pypdf2-0.11.0.dev1-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_pypdf2-0.11.0.dev1-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdf2-0.11.0.dev1-py3-none-any.whl
Algorithm Hash digest
SHA256 8de0a3dd78d8405742d15d957548f6f71c25b30e9240a6435a175c10df505562
MD5 55155f2a878d380bf1c7df998fdc03fc
BLAKE2b-256 d739cb13e9b3def8f42b0bc7f32bd91b7cd1fa7126517022232495708cd1c8d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page