A parser for extracting text from PDFs using PyPDFTK.
Project description
Swarmauri Parser PyPDFTK
Form-field parser for Swarmauri built on PyPDFTK. Extracts PDF AcroForm field metadata and returns it as Swarmauri Document content.
Features
- Calls
pypdftk.dump_data_fieldsto extract field key/value pairs. - Emits a single
Documentwith newline-delimitedkey: valuetext andmetadata['source']set to the PDF path. - Returns an empty list when no form fields exist or when parsing fails (logs the error).
Prerequisites
- Python 3.10 or newer.
- PyPDFTK plus the
pdftk/pdftk-javabinary available on the system path. Install operating-system packages: e.g.,apt install pdftk-javaor downloadpdftkfor macOS/Windows. - Read access to the PDF file path you provide.
Installation
# pip
pip install swarmauri_parser_pypdftk
# poetry
poetry add swarmauri_parser_pypdftk
# uv (pyproject-based projects)
uv add swarmauri_parser_pypdftk
Quickstart
from swarmauri_parser_pypdftk import PyPDFTKParser
parser = PyPDFTKParser()
documents = parser.parse("forms/enrollment.pdf")
for doc in documents:
print(doc.metadata["source"])
print(doc.content)
Example output:
source: forms/enrollment.pdf
GivenName: John
FamilyName: Doe
BirthDate: 1990-01-01
Handling Missing Fields
parser = PyPDFTKParser()
docs = parser.parse("forms/plain.pdf")
if not docs:
print("No form fields detected or parsing failed.")
Tips
- Ensure
pdftkis installed and available onPATH; PyPDFTK delegates to the binary. - For encrypted PDFs, remove or provide the password before parsing;
pdftkcannot dump fields from password-protected documents without credentials. - Combine with other Swarmauri parsers to extract both structured form data (
PyPDFTKParser) and free-form text (PyPDF2ParserorFitzPdfParser).
Want to help?
If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swarmauri_parser_pypdftk-0.8.3.dev3.tar.gz.
File metadata
- Download URL: swarmauri_parser_pypdftk-0.8.3.dev3.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c60c47e9e1aacdb9e07327bed4c0f03014f5c8dbaa7724dba68181ea0c28e12
|
|
| MD5 |
48b0a1ef72d1eea62fa2a5f7b683798d
|
|
| BLAKE2b-256 |
e30d4cbfb6d4854d963664eb2dc9ab1f831927b980212f935cae66e50260c7b2
|
File details
Details for the file swarmauri_parser_pypdftk-0.8.3.dev3-py3-none-any.whl.
File metadata
- Download URL: swarmauri_parser_pypdftk-0.8.3.dev3-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6282520b970a18d855647b166d303d36050bdbc5265f9c1c4db335d619e5b32c
|
|
| MD5 |
51ad256c1aa81b91a8e0a93966cf01db
|
|
| BLAKE2b-256 |
054dfc6ba9f3f1fc2b3870e37774d3f1a6cc313dee96cfcd26b1596b6ed5398f
|