Skip to main content

Page-by-page PDF text parser for Swarmauri using slate3k over local file-path inputs.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_slate Discord

Swarmauri Parser Slate

swarmauri_parser_slate is the Swarmauri PDF parser for page-by-page text extraction using slate3k, a lightweight wrapper around PDFMiner. It reads a local PDF path, extracts text for each page, and returns Swarmauri Document objects with source and page metadata.

Why Use Swarmauri Parser Slate

  • Parse text-based PDFs into page-scoped Document objects for chunking, retrieval, and downstream agent workflows.
  • Keep document ingestion aligned with the Swarmauri parser interface.
  • Use a small PDF extraction dependency when slate3k is sufficient for the target document set.
  • Preserve page numbers so later indexing, annotation, or citation workflows can map text back to the source file.

FAQ

What input does this parser accept?
A local PDF file path as a string.

Does it support raw PDF bytes?
No. The current implementation is path-only and raises TypeError for other input types.

What does it return?
A list of Swarmauri Document objects, usually one per extracted page.

Does it perform OCR on scanned PDFs?
No. It is intended for PDFs that already contain extractable text.

Features

  • Page-by-page PDF text extraction through slate3k.
  • Returns Document objects with page_number and source metadata.
  • Provides a clear TypeError for unsupported input types.
  • Fits Swarmauri ingestion, parsing, and retrieval pipelines.
  • Supports Python 3.10, 3.11, 3.12, 3.13, and 3.14.

Installation

uv add swarmauri_parser_slate
pip install swarmauri_parser_slate

Usage

from swarmauri_parser_slate import SlateParser

parser = SlateParser()
documents = parser.parse("pdfs/handbook.pdf")

for document in documents:
    print(document.metadata["page_number"], document.content[:120])

Examples

Parse a handbook PDF

from swarmauri_parser_slate import SlateParser

parser = SlateParser()
pages = parser.parse("manuals/employee-handbook.pdf")

for page in pages:
    print(page.metadata["page_number"], len(page.content))

Handle missing files and invalid inputs

from swarmauri_parser_slate import SlateParser

parser = SlateParser()

print(parser.parse("missing.pdf"))

try:
    parser.parse(b"%PDF-1.7 ...")
except TypeError as exc:
    print(exc)

Related Packages

Swarmauri Foundations

More Documentation

Best Practices

  • Use this parser for PDFs that already contain selectable text.
  • Route scan-only or image-based PDFs through OCR before parsing.
  • Keep page-granular output when later stages need per-page provenance.
  • Validate representative PDFs first because extraction quality depends on the original PDF structure.

License

This project is licensed under the Apache-2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_slate-0.11.0.dev1.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarmauri_parser_slate-0.11.0.dev1-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file swarmauri_parser_slate-0.11.0.dev1.tar.gz.

File metadata

  • Download URL: swarmauri_parser_slate-0.11.0.dev1.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_slate-0.11.0.dev1.tar.gz
Algorithm Hash digest
SHA256 d3d998d024053a342acde2fb7ac9275e1a58b95c80edd38f60d284b1fc04bd35
MD5 9d58f063cdae52e3be950e5131317099
BLAKE2b-256 3d8b2b42466196b77c3f61baec3cbe0961f4130ed61cd7c2ee5dc926f8cac851

See more details on using hashes here.

File details

Details for the file swarmauri_parser_slate-0.11.0.dev1-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_slate-0.11.0.dev1-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_slate-0.11.0.dev1-py3-none-any.whl
Algorithm Hash digest
SHA256 60fead509bd2927639dc606755ef92c54d242fb0d5f2356ba52684c4038c5287
MD5 9971fe92255e01dc23bc7aba03486e64
BLAKE2b-256 ea17817257d5d87cb3e135be780df1610082888bf05b856d2a29199a32fd6b4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page