Skip to main content

Agent for extracting structured content from PDFs using LangGraph

Project description

PDFMind

An agent for extracting structured content from PDFs using LangGraph and OpenAI.

Features

  • Extract and format text content from PDFs
  • Convert tables to markdown format
  • Extract images with AI-generated descriptions
  • Use LangGraph for agent-based orchestration

Setup

# Install Poetry if you don't have it
curl -sSL https://install.python-poetry.org | python3 -

# Install dependencies
poetry install
# Or install from pypi
pip install pdf_mind

# Install other dependencies
brew install ghostscript
brew install poppler
# apt install ghostscript poppler

N.B.: if you're on OSX, the Ghostscript module may not be found. You can fix that by doing:

mkdir -p ~/lib
ln -s "$(brew --prefix gs)/lib/libgs.dylib" ~/lib

See the Camelot docs for more details on installing the dependency. It'll work without Ghostscript.

Usage

from pdf_mind import PDFExtractionAgent

agent = PDFExtractionAgent()
result = agent.process("path/to/document.pdf")
print(result)

Alternatively, look at example.py for an example that will output metadata on extracted items and token usage:

Development

# Run tests
poetry run pytest

# Lint code
poetry run ruff check .
poetry run black .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_mind-0.1.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

pdf_mind-0.1.2-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_mind-0.1.2.tar.gz.

File metadata

  • Download URL: pdf_mind-0.1.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Linux/6.8.0-1021-azure

File hashes

Hashes for pdf_mind-0.1.2.tar.gz
Algorithm Hash digest
SHA256 165a2a47d8d23805c656c0a19c9ddfe0de3728f190b016f90818c5b6054da225
MD5 174b81655d2f70ea571358c390d78465
BLAKE2b-256 745f264ae87e121c287515175c01a5d56902cf78b9793751e04e9273a57a9ce5

See more details on using hashes here.

File details

Details for the file pdf_mind-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pdf_mind-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Linux/6.8.0-1021-azure

File hashes

Hashes for pdf_mind-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 436d891e7281d17fb3bb9aea37203066ae8fc7c3201f69081ca9b646768771cb
MD5 8fdc19472de6014a693c5f215d2b867b
BLAKE2b-256 c27a8a273e97f6e5135c80ec7f9efd0fe3952ea86cfadb36473fddbe5b52f6b4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page