A wrapper around LangChain’s PDFPlumber integration with added support for image-aware data extraction.
Project description
Smart PDF Plumber
A lightweight wrapper over LangChain’s PDFPlumber integration that extends PDF parsing with image understanding—extracting not just text, but also contextual insights from embedded images.
It is designed for two common cases:
- Extract plain text from PDFs page by page.
- Optionally describe embedded images and include those descriptions in the page text.
Features
- Page-level PDF parsing with
pdfplumber. - Optional character deduplication for PDFs with repeated text.
- Optional image description support using either Google Gemini or Hugging Face vision-language models.
- LangChain-friendly output: a list of
Documentobjects with metadata such as source path, page number, and total pages.
Installation
Install from PyPI:
pip install smartpdfplumber
The project currently targets Python 3.13 or newer.
Quick Start
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader
loader = SmartPDFLoader("path/to/file.pdf")
documents = loader.load()
for document in documents:
print(document.metadata)
print(document.page_content)
Options
SmartPDFLoader forwards keyword arguments to PDFPlumberParser:
SmartPDFLoader(
file_path="path/to/file.pdf",
text_kwargs={"x_tolerance": 2},
dedupe=True,
describe_image=False,
model=None,
)
text_kwargs
Extra keyword arguments passed to pdfplumber.Page.extract_text().
dedupe
Set dedupe=True to call page.dedupe_chars() before extracting text. This can help with PDFs that repeat characters in the output.
describe_image
Set describe_image=True to include image descriptions inline in the page text.
When this is enabled, you must also provide model.
model
Supported values:
geminihuggingface
Image Descriptions
Google Gemini
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader
loader = SmartPDFLoader(
"path/to/file.pdf",
describe_image=True,
model="gemini",
)
documents = loader.load()
This path uses google-genai. Make sure your Google authentication is configured before running it.
Hugging Face
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader
loader = SmartPDFLoader(
"path/to/file.pdf",
dedupe=True,
describe_image=True,
model="huggingface",
)
documents = loader.load()
This path uses transformers, torch, and torchvision to run a vision-language model locally.
Output Format
Each page becomes a LangChain Document with metadata similar to:
{
"source": "path/to/file.pdf",
"file_path": "path/to/file.pdf",
"page": 0,
"total_pages": 12,
}
The page value is zero-based.
Example
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader
loader = SmartPDFLoader(
"assets/sample.pdf",
dedupe=True,
describe_image=True,
model="huggingface",
)
documents = loader.load()
for document in documents:
print(document.page_content[:200])
Notes
- If you enable image descriptions without passing
model, the parser raises aValueError. - If you use
model="gemini",google-genaimust be installed and authenticated. - If you use
model="huggingface", the model is loaded lazily and cached for reuse.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smartpdfplumber-0.1.2.tar.gz.
File metadata
- Download URL: smartpdfplumber-0.1.2.tar.gz
- Upload date:
- Size: 3.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa9af46344ebde2ed9525236ae25bbd31f6b35462e61e1b9f3174b1d11af52e5
|
|
| MD5 |
1c39de4c0d0fc1341d316c9d834aaf76
|
|
| BLAKE2b-256 |
0f7efffe918adf963adce4e35f6600232902f5b2d49e4b8e53ca10dff506e4b8
|
File details
Details for the file smartpdfplumber-0.1.2-py3-none-any.whl.
File metadata
- Download URL: smartpdfplumber-0.1.2-py3-none-any.whl
- Upload date:
- Size: 3.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f303c8d87b35ac0c9d35468eb0e930ed0d71e9a105db1d5dc707bca960b5fba
|
|
| MD5 |
a95c33011d9d39830859b52ca5a3e86c
|
|
| BLAKE2b-256 |
7da9c6194071abd44ee58e9a6375f87f7fa73d96c5a0b3e746374800de6e0a38
|