Skip to main content

A wrapper around LangChain’s PDFPlumber integration with added support for image-aware data extraction.

Project description

Smart PDF Plumber

A lightweight wrapper over LangChain’s PDFPlumber integration that extends PDF parsing with image understanding—extracting not just text, but also contextual insights from embedded images.

It is designed for two common cases:

  • Extract plain text from PDFs page by page.
  • Optionally describe embedded images and include those descriptions in the page text.

Features

  • Page-level PDF parsing with pdfplumber.
  • Optional character deduplication for PDFs with repeated text.
  • Optional image description support using either Google Gemini or Hugging Face vision-language models.
  • LangChain-friendly output: a list of Document objects with metadata such as source path, page number, and total pages.

Installation

Install from PyPI:

pip install smartpdfplumber

The project currently targets Python 3.13 or newer.

Quick Start

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader("path/to/file.pdf")
documents = loader.load()

for document in documents:
	print(document.metadata)
	print(document.page_content)

Options

SmartPDFLoader forwards keyword arguments to PDFPlumberParser:

SmartPDFLoader(
	file_path="path/to/file.pdf",
	text_kwargs={"x_tolerance": 2},
	dedupe=True,
	describe_image=False,
	model=None,
)

text_kwargs

Extra keyword arguments passed to pdfplumber.Page.extract_text().

dedupe

Set dedupe=True to call page.dedupe_chars() before extracting text. This can help with PDFs that repeat characters in the output.

describe_image

Set describe_image=True to include image descriptions inline in the page text.

When this is enabled, you must also provide model.

model

Supported values:

  • gemini
  • huggingface

Image Descriptions

Google Gemini

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	describe_image=True,
	model="gemini",
)
documents = loader.load()

This path uses google-genai. Make sure your Google authentication is configured before running it.

Hugging Face

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	dedupe=True,
	describe_image=True,
	model="huggingface",
)
documents = loader.load()

This path uses transformers, torch, and torchvision to run a vision-language model locally.

Output Format

Each page becomes a LangChain Document with metadata similar to:

{
	"source": "path/to/file.pdf",
	"file_path": "path/to/file.pdf",
	"page": 0,
	"total_pages": 12,
}

The page value is zero-based.

Example

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"assets/sample.pdf",
	dedupe=True,
	describe_image=True,
	model="huggingface",
)

documents = loader.load()

for document in documents:
	print(document.page_content[:200])

Notes

  • If you enable image descriptions without passing model, the parser raises a ValueError.
  • If you use model="gemini", google-genai must be installed and authenticated.
  • If you use model="huggingface", the model is loaded lazily and cached for reuse.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartpdfplumber-0.1.2.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartpdfplumber-0.1.2-py3-none-any.whl (3.3 kB view details)

Uploaded Python 3

File details

Details for the file smartpdfplumber-0.1.2.tar.gz.

File metadata

  • Download URL: smartpdfplumber-0.1.2.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for smartpdfplumber-0.1.2.tar.gz
Algorithm Hash digest
SHA256 aa9af46344ebde2ed9525236ae25bbd31f6b35462e61e1b9f3174b1d11af52e5
MD5 1c39de4c0d0fc1341d316c9d834aaf76
BLAKE2b-256 0f7efffe918adf963adce4e35f6600232902f5b2d49e4b8e53ca10dff506e4b8

See more details on using hashes here.

File details

Details for the file smartpdfplumber-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for smartpdfplumber-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8f303c8d87b35ac0c9d35468eb0e930ed0d71e9a105db1d5dc707bca960b5fba
MD5 a95c33011d9d39830859b52ca5a3e86c
BLAKE2b-256 7da9c6194071abd44ee58e9a6375f87f7fa73d96c5a0b3e746374800de6e0a38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page