Skip to main content

An integration package connecting PyMuPDF Layout to LangChain. Load PDF content to Markdown using AI-based, CPU only, layout analysis.

Project description

langchain-pymupdf-layout

An integration package connecting PyMuPDF Layout to LangChain.

Load PDF content to Markdown using AI-based, CPU only, layout analysis.

LangChain Python version License PolyForm Noncommercial

Features

  • 📚 Structured data extraction from your documents
  • 🧐 Advanced document page layout understanding, including semantic markup for titles, headings, headers, footers, tables, images and text styling
  • 🔍 Detect and isolate header and footer patterns on each page

For more detailed information visit the official PyMuPDF Layout documentation webpage.

Requirements

  • Python 3.11 or higher
  • LangChain Core v1.0.0 or higher
  • PyMuPDF v1.26.6 or higher
  • PyMuPDF4LLM v0.2.0 or higher
  • PyMuPDF Layout v1.26.6 or higher

Installation

Install the package using pip to start using the Document Loader:

pip install -U langchain-pymupdf-layout

Usage

You can easily integrate and use the PyMuPDF Layout Loader in your Python application for loading and parsing PDFs.

Below is an example of how to set up and utilize this loader:

from langchain_pymupdf_layout import version

print(version())  # Output: version number

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_pymupdf_layout import PyMuPDFLayoutLoader

loader = PyMuPDFLayoutLoader(
    file_path="https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    show_progress=False,  
    # See other loader options on https://pymupdf.readthedocs.io/en/latest/pymupdf-layout/index.html#pymupdf-layout-and-parameter-caveats
)

documents = loader.load()

# Chunk
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

print(f"Loaded {len(documents)} document(s)")
print(f"Created {len(chunks)} chunk(s)")

content = chunks[0].page_content
print(f"\ncontent:\n{content}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_pymupdf_layout-0.1.1.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_pymupdf_layout-0.1.1-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file langchain_pymupdf_layout-0.1.1.tar.gz.

File metadata

  • Download URL: langchain_pymupdf_layout-0.1.1.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.7

File hashes

Hashes for langchain_pymupdf_layout-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9d9fca680479f3e454fc80e01461435d3f164c2cacd43f8a6914fcf21ea66633
MD5 e4bdc17dd178def9f7b0c97ff2d2ccf1
BLAKE2b-256 fb22d9fcba02d2939131bb7826f10a75b25a94093f43e802e0ba7d3bd97145a1

See more details on using hashes here.

File details

Details for the file langchain_pymupdf_layout-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_pymupdf_layout-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f2e1c81945772cee63264204020a103238c233fb729bd1ba4da869b1dda939a8
MD5 e091a2aa0ddd1c651cff10e754da6929
BLAKE2b-256 a6c21fe28e29824a2818ed81988ac167935eee2c587fd531a7ee3e841111f93f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page