A wrapper around LangChain’s PDFPlumber integration with added support for image-aware data extraction.

Project description

Smart PDF Plumber

A lightweight wrapper over LangChain’s PDFPlumber integration that extends PDF parsing with image understanding—extracting not just text, but also contextual insights from embedded images.

It is designed for two common cases:

Extract plain text from PDFs page by page.
Optionally describe embedded images and include those descriptions in the page text.

Features

Page-level PDF parsing with pdfplumber.
Optional character deduplication for PDFs with repeated text.
Optional image description support using either Google Gemini or Hugging Face vision-language models.
LangChain-friendly output: a list of Document objects with metadata such as source path, page number, and total pages.

Installation

Install from PyPI:

pip install smartpdfplumber

The project currently targets Python 3.13 or newer.

Quick Start

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader("path/to/file.pdf")
documents = loader.load()

for document in documents:
	print(document.metadata)
	print(document.page_content)

Options

SmartPDFLoader forwards keyword arguments to PDFPlumberParser:

SmartPDFLoader(
	file_path="path/to/file.pdf",
	text_kwargs={"x_tolerance": 2},
	dedupe=True,
	describe_image=False,
	model=None,
)

`text_kwargs`

Extra keyword arguments passed to pdfplumber.Page.extract_text().

`dedupe`

Set dedupe=True to call page.dedupe_chars() before extracting text. This can help with PDFs that repeat characters in the output.

`describe_image`

Set describe_image=True to include image descriptions inline in the page text.

When this is enabled, you must also provide model.

`model`

Supported values:

gemini
huggingface

Image Descriptions

Google Gemini

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	describe_image=True,
	model="gemini",
)
documents = loader.load()

This path uses google-genai. Make sure your Google authentication is configured before running it.

Hugging Face

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	dedupe=True,
	describe_image=True,
	model="huggingface",
)
documents = loader.load()

This path uses transformers, torch, and torchvision to run a vision-language model locally.

Output Format

Each page becomes a LangChain Document with metadata similar to:

{
	"source": "path/to/file.pdf",
	"file_path": "path/to/file.pdf",
	"page": 0,
	"total_pages": 12,
}

The page value is zero-based.

Example

from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"assets/sample.pdf",
	dedupe=True,
	describe_image=True,
	model="huggingface",
)

documents = loader.load()

for document in documents:
	print(document.page_content[:200])

Notes

If you enable image descriptions without passing model, the parser raises a ValueError.
If you use model="gemini", google-genai must be installed and authenticated.
If you use model="huggingface", the model is loaded lazily and cached for reuse.

License

MIT

Project details

Release history Release notifications | RSS feed

0.1.7

May 9, 2026

0.1.6

May 9, 2026

0.1.5

May 6, 2026

0.1.4

May 5, 2026

0.1.3

May 3, 2026

This version

0.1.2

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartpdfplumber-0.1.2.tar.gz (3.6 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smartpdfplumber-0.1.2-py3-none-any.whl (3.3 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file smartpdfplumber-0.1.2.tar.gz.

File metadata

Download URL: smartpdfplumber-0.1.2.tar.gz
Upload date: May 3, 2026
Size: 3.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for smartpdfplumber-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`aa9af46344ebde2ed9525236ae25bbd31f6b35462e61e1b9f3174b1d11af52e5`
MD5	`1c39de4c0d0fc1341d316c9d834aaf76`
BLAKE2b-256	`0f7efffe918adf963adce4e35f6600232902f5b2d49e4b8e53ca10dff506e4b8`

See more details on using hashes here.

File details

Details for the file smartpdfplumber-0.1.2-py3-none-any.whl.

File metadata

Download URL: smartpdfplumber-0.1.2-py3-none-any.whl
Upload date: May 3, 2026
Size: 3.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for smartpdfplumber-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f303c8d87b35ac0c9d35468eb0e930ed0d71e9a105db1d5dc707bca960b5fba`
MD5	`a95c33011d9d39830859b52ca5a3e86c`
BLAKE2b-256	`7da9c6194071abd44ee58e9a6375f87f7fa73d96c5a0b3e746374800de6e0a38`

See more details on using hashes here.

smartpdfplumber 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Smart PDF Plumber

Features

Installation

Quick Start

Options

`text_kwargs`

`dedupe`

`describe_image`

`model`

Image Descriptions

Google Gemini

Hugging Face

Output Format

Example

Notes

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes