Skip to main content

A wrapper around LangChain’s PDFPlumber integration with added support for image-aware data extraction.

Project description

Smart PDF Plumber

A lightweight wrapper over LangChain’s PDFPlumber integration that extends PDF parsing with image understanding—extracting not just text, but also contextual insights from embedded images.

It is designed for two common cases:

  • Extract plain text from PDFs page by page.
  • FEATURE: describe embedded images and include those descriptions in the page text.

Features

  • Page-level PDF parsing with pdfplumber.
  • Optional character deduplication for PDFs with repeated text.
  • Optional image description support using either Google Gemini or Hugging Face vision-language models.
  • LangChain-friendly output: a list of Document objects with metadata such as source path, page number, and total pages.

Installation

Install from PyPI:

pip install smartpdfplumber

The project currently targets Python 3.13 or newer.

Quick Start

from smartpdfplumber.loader import SmartPDFLoader

loader = SmartPDFLoader("path/to/file.pdf", describe_image=True, inference="groq_ai")
documents = loader.load()

for document in documents:
	print(document.metadata)
	print(document.page_content)

text_kwargs

Extra keyword arguments passed to pdfplumber.Page.extract_text().

dedupe

Set dedupe=True to call page.dedupe_chars() before extracting text. This can help with PDFs that repeat characters in the output.

describe_image

Set describe_image=True to include image descriptions inline in the page text.

When this is enabled, you must also provide inference.

inference

Supported values:

  • gemini
  • hf_transformers
  • groq_ai

Image Descriptions

from smartpdfplumber.loader import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	dedupe=True,
	describe_image=True,
	inference="huggingface",
	model="Qwen/Qwen3.5-0.8B" # Optional: Default("Qwen/Qwen3.5-0.8B")
)
documents = loader.load()

Notes

  • If you enable image descriptions without passing model, the parser raises a ValueError.
  • If you use model="hf_transformers", the model is loaded lazily and cached for reuse.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartpdfplumber-0.1.6.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartpdfplumber-0.1.6-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file smartpdfplumber-0.1.6.tar.gz.

File metadata

  • Download URL: smartpdfplumber-0.1.6.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for smartpdfplumber-0.1.6.tar.gz
Algorithm Hash digest
SHA256 98b4791966d0a1cbe8c7831dadad5a8322c39dc4d61dae9ee05857749f7624cd
MD5 2bf75f983c32f13781649d6ff5317d16
BLAKE2b-256 6e4ff67611c339c1abf5589bf45f2d6cd195e5040fd5b7e2678b3c0fd41eabce

See more details on using hashes here.

File details

Details for the file smartpdfplumber-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for smartpdfplumber-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2a54d20d46fb3fe3b4f2a85e1f18f7a44553a12c9539455493ab71a4a439fd21
MD5 f034a6aab3fe7b2a1b005af53b692eeb
BLAKE2b-256 7d56d26f55d7496308f6278ec1f78185e0d85fa37ad7010e23b6dd978f5ac9a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page