Skip to main content

An integration package connecting PyMuPDF Layout to LangChain. Load PDF content to Markdown using AI-based, CPU only, layout analysis.

Project description

langchain-pymupdf-layout

An integration package connecting PyMuPDF Layout to LangChain.

Load PDF content to Markdown using AI-based, CPU only, layout analysis.

LangChain Python version License PolyForm Noncommercial

Features

  • 📚 Structured data extraction from your documents
  • 🧐 Advanced document page layout understanding, including semantic markup for titles, headings, headers, footers, tables, images and text styling
  • 🔍 Detect and isolate header and footer patterns on each page

For more detailed information visit the official PyMuPDF Layout documentation webpage.

Requirements

  • Python 3.11 or higher
  • LangChain Core v1.0.0 or higher
  • PyMuPDF v1.26.6 or higher
  • PyMuPDF4LLM v0.2.0 or higher
  • PyMuPDF Layout v1.26.6 or higher

Installation

Install the package using pip to start using the Document Loader:

pip install -U langchain-pymupdf-layout

Usage

You can easily integrate and use the PyMuPDF Layout Loader in your Python application for loading and parsing PDFs.

Below is an example of how to set up and utilize this loader:

from langchain_pymupdf_layout import version

print(version())  # Output: version number

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_pymupdf_layout import PyMuPDFLayoutLoader

loader = PyMuPDFLayoutLoader(
    file_path="https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    show_progress=False,  
    # See other loader options on https://pymupdf.readthedocs.io/en/latest/pymupdf-layout/index.html#pymupdf-layout-and-parameter-caveats
)

documents = loader.load()

# Chunk
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

print(f"Loaded {len(documents)} document(s)")
print(f"Created {len(chunks)} chunk(s)")

content = chunks[0].page_content
print(f"\ncontent:\n{content}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_pymupdf_layout-0.1.3.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_pymupdf_layout-0.1.3-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file langchain_pymupdf_layout-0.1.3.tar.gz.

File metadata

  • Download URL: langchain_pymupdf_layout-0.1.3.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.7

File hashes

Hashes for langchain_pymupdf_layout-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8c5af2315c20a1b6dea5d79d684747f8288d803c3caf274bdf96aff1f2bb8681
MD5 fc51745ab97fd0920775736597544321
BLAKE2b-256 403ad2df8725975cbd35ff038ea90d5bde7145f866b3d8554e51823ab59ccaa2

See more details on using hashes here.

File details

Details for the file langchain_pymupdf_layout-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_pymupdf_layout-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 68499352327de664b7fa70800ac51362bd11f35447e736847585542cf3547bbd
MD5 d58cc174fcac7d7d9616f0a888424502
BLAKE2b-256 35197c1d081e94295b7bd8f9eefd5983d75ba5bca74a23c26843306f40039156

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page