Skip to main content

An integration package connecting PyMuPDF Layout to LangChain. Load PDF content to Markdown using AI-based, CPU only, layout analysis.

Project description

langchain-pymupdf-layout

An integration package connecting PyMuPDF Layout to LangChain.

Load PDF content to Markdown using AI-based, CPU only, layout analysis.

LangChain Python version License PolyForm Noncommercial

Features

  • 📚 Structured data extraction from your documents
  • 🧐 Advanced document page layout understanding, including semantic markup for titles, headings, headers, footers, tables, images and text styling
  • 🔍 Detect and isolate header and footer patterns on each page

For more detailed information visit the official PyMuPDF Layout documentation webpage.

Requirements

  • Python 3.11 or higher
  • LangChain Core v1.0.0 or higher
  • PyMuPDF v0.26.6 or higher
  • PyMuPDF4LLM v0.2.0 or higher
  • PyMuPDF Layout c0.26.6 or higher

Installation

Install the package using pip to start using the Document Loader:

pip install -U langchain-pymupdf-layout

Usage

You can easily integrate and use the PyMuPDF Layout Loader in your Python application for loading and parsing PDFs.

Below is an example of how to set up and utilize this loader:

from langchain_pymupdf_layout import version

print(version())  # Output: version number

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_pymupdf_layout import PyMuPDFLayoutLoader

loader = PyMuPDFLayoutLoader(
    file_path="https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    show_progress=False,  
    # See other loader options on https://pymupdf.readthedocs.io/en/latest/pymupdf-layout/index.html#pymupdf-layout-and-parameter-caveats
)

documents = loader.load()

# Chunk
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

print(f"Loaded {len(documents)} document(s)")
print(f"Created {len(chunks)} chunk(s)")

content = chunks[0].page_content
print(f"\ncontent:\n{content}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_pymupdf_layout-0.1.0.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_pymupdf_layout-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file langchain_pymupdf_layout-0.1.0.tar.gz.

File metadata

  • Download URL: langchain_pymupdf_layout-0.1.0.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.7

File hashes

Hashes for langchain_pymupdf_layout-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e5bbe2ad7f7c88878f924c77973f4bd1fa346bffcaaa8f413bb803cf3530188
MD5 7d1aa1d59d8dc18eca692a3da04272e0
BLAKE2b-256 28b4d860183afc56da3af286bd4d4093c6a960f1648168b09afc8eadf474f4e2

See more details on using hashes here.

File details

Details for the file langchain_pymupdf_layout-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_pymupdf_layout-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 58313a3329899b8326fecc98c1da6eb4e64bb8c9abd52fe87701732b29a9bc68
MD5 b03e5ee28ec9a74202064b3cea06b3f9
BLAKE2b-256 96671544f39a9d1f32ada109a6e99315beca58f3b7362cb0507306f6b4b8c9bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page