Skip to main content

An integration package connecting PyMuPDF Layout to LangChain. Load PDF content to Markdown using AI-based, CPU only, layout analysis.

Project description

langchain-pymupdf-layout

An integration package connecting PyMuPDF Layout to LangChain.

Load PDF content to Markdown using AI-based, CPU only, layout analysis.

LangChain Python version License PolyForm Noncommercial

Features

  • 📚 Structured data extraction from your documents
  • 🧐 Advanced document page layout understanding, including semantic markup for titles, headings, headers, footers, tables, images and text styling
  • 🔍 Detect and isolate header and footer patterns on each page

For more detailed information visit the official PyMuPDF Layout documentation webpage.

Requirements

  • Python 3.11 or higher
  • LangChain Core v1.0.0 or higher
  • PyMuPDF v1.26.6 or higher
  • PyMuPDF4LLM v0.2.0 or higher
  • PyMuPDF Layout v1.26.6 or higher

Installation

Install the package using pip to start using the Document Loader:

pip install -U langchain-pymupdf-layout

Usage

You can easily integrate and use the PyMuPDF Layout Loader in your Python application for loading and parsing PDFs.

Below is an example of how to set up and utilize this loader:

from langchain_pymupdf_layout import version

print(version())  # Output: version number

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_pymupdf_layout import PyMuPDFLayoutLoader

loader = PyMuPDFLayoutLoader(
    file_path="https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    show_progress=False,  
    # See other loader options on https://pymupdf.readthedocs.io/en/latest/pymupdf-layout/index.html#pymupdf-layout-and-parameter-caveats
)

documents = loader.load()

# Chunk
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

print(f"Loaded {len(documents)} document(s)")
print(f"Created {len(chunks)} chunk(s)")

content = chunks[0].page_content
print(f"\ncontent:\n{content}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_pymupdf_layout-0.1.2.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_pymupdf_layout-0.1.2-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file langchain_pymupdf_layout-0.1.2.tar.gz.

File metadata

  • Download URL: langchain_pymupdf_layout-0.1.2.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.7

File hashes

Hashes for langchain_pymupdf_layout-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e16afb6bfe6768a952aa64fa8d01010e5ca07af0fd567777c1349eddcd2f4750
MD5 2e4c680b818d6108aac59d5359841352
BLAKE2b-256 e06210d5ce4baaf856cb0ce53a33ecef4b7180aba5f8e74187d186b8651dfbd3

See more details on using hashes here.

File details

Details for the file langchain_pymupdf_layout-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_pymupdf_layout-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3d29043cdfd54da66e023e6f905c580b1c724f95b0c2701f0434ec56b4459039
MD5 fcf6b127f8fe65c540c15f5fe2d94059
BLAKE2b-256 5ebd3038d70da76480d780afaf2ba57a74a1cee269bf149f5dea398427e83e07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page