Skip to main content

A parser for extracting text from PDFs using PyPDFTK.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_pypdftk


Swarmauri Parser PyPDFTK

Form-field parser for Swarmauri built on PyPDFTK. Extracts PDF AcroForm field metadata and returns it as Swarmauri Document content.

Features

  • Calls pypdftk.dump_data_fields to extract field key/value pairs.
  • Emits a single Document with newline-delimited key: value text and metadata['source'] set to the PDF path.
  • Returns an empty list when no form fields exist or when parsing fails (logs the error).

Prerequisites

  • Python 3.10 or newer.
  • PyPDFTK plus the pdftk/pdftk-java binary available on the system path. Install operating-system packages: e.g., apt install pdftk-java or download pdftk for macOS/Windows.
  • Read access to the PDF file path you provide.

Installation

# pip
pip install swarmauri_parser_pypdftk

# poetry
poetry add swarmauri_parser_pypdftk

# uv (pyproject-based projects)
uv add swarmauri_parser_pypdftk

Quickstart

from swarmauri_parser_pypdftk import PyPDFTKParser

parser = PyPDFTKParser()
documents = parser.parse("forms/enrollment.pdf")

for doc in documents:
    print(doc.metadata["source"])
    print(doc.content)

Example output:

source: forms/enrollment.pdf
GivenName: John
FamilyName: Doe
BirthDate: 1990-01-01

Handling Missing Fields

parser = PyPDFTKParser()
docs = parser.parse("forms/plain.pdf")

if not docs:
    print("No form fields detected or parsing failed.")

Tips

  • Ensure pdftk is installed and available on PATH; PyPDFTK delegates to the binary.
  • For encrypted PDFs, remove or provide the password before parsing; pdftk cannot dump fields from password-protected documents without credentials.
  • Combine with other Swarmauri parsers to extract both structured form data (PyPDFTKParser) and free-form text (PyPDF2Parser or FitzPdfParser).

Want to help?

If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_pypdftk-0.9.0.dev4.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarmauri_parser_pypdftk-0.9.0.dev4-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file swarmauri_parser_pypdftk-0.9.0.dev4.tar.gz.

File metadata

  • Download URL: swarmauri_parser_pypdftk-0.9.0.dev4.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdftk-0.9.0.dev4.tar.gz
Algorithm Hash digest
SHA256 c4b2f7ee9566c7fda6c133083df608765037cd3a96f9daa7b39ebd7a5407417b
MD5 1e223782efe2b1f97193108ec4841f5e
BLAKE2b-256 a568f7bceb8c62cfd1440e3e1457c54a564e627367c8b03deb8a7cd3a2d21f4a

See more details on using hashes here.

File details

Details for the file swarmauri_parser_pypdftk-0.9.0.dev4-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_pypdftk-0.9.0.dev4-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_pypdftk-0.9.0.dev4-py3-none-any.whl
Algorithm Hash digest
SHA256 e2652b70a08f82bae7d05312077a69a79a999484a20b6f985cd2dcab19c6e03f
MD5 0311061df69ca136af42f197c9199b59
BLAKE2b-256 fcf98032f7eede74490c0e65b9f30fa156a9ae79fa63eaba7c5a7de75e902f7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page