A wrapper around LangChain’s PDFPlumber integration with added support for image-aware data extraction.
Project description
Smart PDF Plumber
A lightweight wrapper over LangChain’s PDFPlumber integration that extends PDF parsing with image understanding—extracting not just text, but also contextual insights from embedded images.
It is designed for two common cases:
- Extract plain text from PDFs page by page.
- FEATURE: describe embedded images and include those descriptions in the page text.
Features
- Page-level PDF parsing with
pdfplumber. - Optional character deduplication for PDFs with repeated text.
- Optional image description support using either Google Gemini or Hugging Face vision-language models.
- LangChain-friendly output: a list of
Documentobjects with metadata such as source path, page number, and total pages.
Installation
Install from PyPI:
pip install smartpdfplumber
The project currently targets Python 3.13 or newer.
Quick Start
from smartpdfplumber.loader import SmartPDFLoader
loader = SmartPDFLoader("path/to/file.pdf", describe_image=True, inference="groq_ai")
documents = loader.load()
for document in documents:
print(document.metadata)
print(document.page_content)
text_kwargs
Extra keyword arguments passed to pdfplumber.Page.extract_text().
dedupe
Set dedupe=True to call page.dedupe_chars() before extracting text. This can help with PDFs that repeat characters in the output.
describe_image
Set describe_image=True to include image descriptions inline in the page text.
When this is enabled, you must also provide inference.
inference
Supported values:
geminihf_transformersgroq_ai
Image Descriptions
from smartpdfplumber.loader import SmartPDFLoader
loader = SmartPDFLoader(
"path/to/file.pdf",
dedupe=True,
describe_image=True,
inference="huggingface",
model="Qwen/Qwen3.5-0.8B" # Optional: Default("Qwen/Qwen3.5-0.8B")
)
documents = loader.load()
Notes
- If you enable image descriptions without passing
model, the parser raises aValueError. - If you use
model="hf_transformers", the model is loaded lazily and cached for reuse.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smartpdfplumber-0.1.5.tar.gz.
File metadata
- Download URL: smartpdfplumber-0.1.5.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a086cb70aa1e369f0504f699d7f24cc6543eb5ec0e37a3541cb7a2f896ed6c57
|
|
| MD5 |
084c4db343d53bf4e85962f6caf3142a
|
|
| BLAKE2b-256 |
4f079759b070af96697c0cdd350d2d7b74a43eeafb11472ebcf4a13de81b5cd9
|
File details
Details for the file smartpdfplumber-0.1.5-py3-none-any.whl.
File metadata
- Download URL: smartpdfplumber-0.1.5-py3-none-any.whl
- Upload date:
- Size: 8.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74c6ceeddc02e0aa6d3ed5e7fdeb8388cd85d5b1b2345d5f7ce0272a2852db08
|
|
| MD5 |
75e93f2e66c73dbc8bd8ffa9934a5d24
|
|
| BLAKE2b-256 |
108c1771bd27293121674673fe1c5722f461a4a8662949a07c71457b858871b3
|