A parser for extracting text from PDFs using Slate.
Project description
Swarmauri Parser Slate
PDF text parser for Swarmauri using Slate3k (a lightweight PDFMiner wrapper). Extracts text from each PDF page and returns Document instances with page metadata.
Features
- Opens PDFs with Slate3k and returns a
Documentper page (content= text,metadataincludespage_numberandsource). - Accepts file paths (string). Raises a
TypeErrorwhen given anything else to prevent silent failures. - Returns an empty list if Slate encounters parsing errors, logging the exception to stdout.
Prerequisites
- Python 3.10 or newer.
- Slate3k depends on
pdfminer.six; make sure operating-system libraries required by PDFMiner (e.g.,libxml2,libxslton Linux) are installed. - Read access to the PDF path you pass in.
Installation
# pip
pip install swarmauri_parser_slate
# poetry
poetry add swarmauri_parser_slate
# uv (pyproject-based projects)
uv add swarmauri_parser_slate
Quickstart
from swarmauri_parser_slate import SlateParser
parser = SlateParser()
documents = parser.parse("pdfs/handbook.pdf")
for doc in documents:
print(doc.metadata["page_number"], doc.content[:120])
Handling Errors
parser = SlateParser()
try:
docs = parser.parse("missing.pdf")
if not docs:
print("No pages parsed or Slate returned no text.")
except TypeError as exc:
print(f"Bad input: {exc}")
Tips
- Slate3k works best on text-based PDFs. For scanned/bitmap PDFs, run OCR first (e.g.,
swarmauri_ocr_pytesseract). - Large PDFs can consume memory; consider chunking results or streaming pages to downstream processors.
- Combine with token counting or summarization measurements in Swarmauri to further process the extracted content.
Want to help?
If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swarmauri_parser_slate-0.2.2.dev7.tar.gz.
File metadata
- Download URL: swarmauri_parser_slate-0.2.2.dev7.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7be697fb137f1e51280ae3a8c18e41c9f65dac3bfdea2eeb82019df54677fe74
|
|
| MD5 |
2d3ad917b39e842a175216e8fde93a08
|
|
| BLAKE2b-256 |
24d1ead906189366c6dafb76a33844c3c93fb0a681705a70544be559535a549e
|
File details
Details for the file swarmauri_parser_slate-0.2.2.dev7-py3-none-any.whl.
File metadata
- Download URL: swarmauri_parser_slate-0.2.2.dev7-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbd2a9e7d694d9ab89718e43ce889107637760bd16078eb21940e775ad417fb3
|
|
| MD5 |
cbe29885eded75baf4d9593c5f547be6
|
|
| BLAKE2b-256 |
6e3704b8342418318dd3f31dd6fd5b16f7b8cef1667c77db7604c39e43424ac4
|