Agent for extracting structured content from PDFs using LangGraph
Project description
PDFMind
An agent for extracting structured content from PDFs using LangGraph and OpenAI.
Features
- Extract and format text content from PDFs
- Convert tables to markdown format
- Extract images with AI-generated descriptions
- Use LangGraph for agent-based orchestration
Setup
# Install Poetry if you don't have it
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies
poetry install
# Or install from pypi
pip install pdf_mind
# Install other dependencies
brew install ghostscript
brew install poppler
# apt install ghostscript poppler
N.B.: if you're on OSX, the Ghostscript module may not be found. You can fix that by doing:
mkdir -p ~/lib
ln -s "$(brew --prefix gs)/lib/libgs.dylib" ~/lib
See the Camelot docs for more details on installing the dependency. It'll work without Ghostscript.
Usage
from pdf_mind import PDFExtractionAgent
agent = PDFExtractionAgent()
result = agent.process("path/to/document.pdf")
print(result)
Alternatively, look at example.py for an example that will output metadata on extracted items and token usage:
Development
# Run tests
poetry run pytest
# Lint code
poetry run ruff check .
poetry run black .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf_mind-0.1.2.tar.gz
(10.5 kB
view details)
Built Distribution
pdf_mind-0.1.2-py3-none-any.whl
(14.2 kB
view details)
File details
Details for the file pdf_mind-0.1.2.tar.gz
.
File metadata
- Download URL: pdf_mind-0.1.2.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.10.16 Linux/6.8.0-1021-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 165a2a47d8d23805c656c0a19c9ddfe0de3728f190b016f90818c5b6054da225 |
|
MD5 | 174b81655d2f70ea571358c390d78465 |
|
BLAKE2b-256 | 745f264ae87e121c287515175c01a5d56902cf78b9793751e04e9273a57a9ce5 |
File details
Details for the file pdf_mind-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: pdf_mind-0.1.2-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.10.16 Linux/6.8.0-1021-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 436d891e7281d17fb3bb9aea37203066ae8fc7c3201f69081ca9b646768771cb |
|
MD5 | 8fdc19472de6014a693c5f215d2b867b |
|
BLAKE2b-256 | c27a8a273e97f6e5135c80ec7f9efd0fe3952ea86cfadb36473fddbe5b52f6b4 |