Skip to main content

Agent-safe PDF extraction SDK with structured output and quality reports.

Project description

Psyduck

Psyduck is a small Python SDK for turning PDFs into agent-ready structured documents.

It is intentionally SDK-only: import it from Python, run Psyduck().process(...), and inspect the returned document, profile, quality report, and exported artifacts.

Install

pip install psyduck

Development install:

pip install -e ".[dev]"

Python API

from psyduck import Psyduck

duck = Psyduck()
result = duck.process("report.pdf", goal="rag", mode="balanced", return_content=True)

if result.quality.needs_review:
    for warning in result.quality.warnings:
        print(warning.code, warning.message)

for block in result.document.blocks:
    print(block.page, block.text)

Output Directory

output/report-<timestamp>-<id>/
  run.json
  profile.json
  document.md
  document.json
  quality.json
  tables/
  assets/
  pages/

SDK Contract

  • Default extraction uses PyMuPDF.
  • process() always writes Markdown, JSON, profile, quality, and run metadata.
  • return_content=False keeps large document content out of the immediate result.
  • load_result(output_dir) reloads a previous SDK run.
  • Custom extractors can be supplied through extractor_registry and requested with process(..., extractors=[...]).
  • tables="force" and ocr="force" report needs_table_adapter / needs_ocr_adapter warnings unless a caller-provided extractor handles that work.

Custom Extractors

from psyduck import Psyduck
from psyduck.extractors.base import ExtractorOutput
from psyduck.extractors.pymupdf import PyMuPDFExtractor


class MyExtractor:
    def extract(self, file_path, pages=None):
        return ExtractorOutput(source="my_extractor")


duck = Psyduck(
    extractor_registry={
        "pymupdf": PyMuPDFExtractor,
        "my_extractor": lambda: MyExtractor(),
    }
)

result = duck.process("report.pdf", extractors=["pymupdf", "my_extractor"])

Agent Policy

  1. Always call process() before answering questions about PDF contents.
  2. Check result.quality before using extracted content.
  3. Use Markdown for summaries and JSON for page-aware answers.
  4. Do not claim complete extraction when needs_review is true.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

psyduck-0.1.0.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

psyduck-0.1.0-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file psyduck-0.1.0.tar.gz.

File metadata

  • Download URL: psyduck-0.1.0.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for psyduck-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5e1c2785695d1e5aa70f7151994e046fe5afee0e7ec6a392de34f71ae3f0bb06
MD5 4737a9d125e2175ea08f057d4fe9d80f
BLAKE2b-256 bd1ea2d110466db7c43aa82f17d235bb1fe59ffd20a741b91e82859c071a04f7

See more details on using hashes here.

File details

Details for the file psyduck-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: psyduck-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for psyduck-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb4ee174116603a02ff588c6a3655b453773569b939dae760776c1bca06f3913
MD5 ea892c98ac48edf68c01ff2f45080b36
BLAKE2b-256 1216c914a1b375838cbf150a9c9835ad9aafeac1facaed75b4bfb2f88e0f4f8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page