An integration package connecting PyMuPDF Layout to LangChain. Load PDF content to Markdown using AI-based, CPU only, layout analysis.
Project description
langchain-pymupdf-layout
An integration package connecting PyMuPDF Layout to LangChain.
Load PDF content to Markdown using AI-based, CPU only, layout analysis.
Features
- 📚 Structured data extraction from your documents
- 🧐 Advanced document page layout understanding, including semantic markup for titles, headings, headers, footers, tables, images and text styling
- 🔍 Detect and isolate header and footer patterns on each page
For more detailed information visit the official PyMuPDF Layout documentation webpage.
Requirements
- Python 3.11 or higher
- LangChain Core v1.0.0 or higher
- PyMuPDF v1.26.6 or higher
- PyMuPDF4LLM v0.2.0 or higher
- PyMuPDF Layout v1.26.6 or higher
Installation
Install the package using pip to start using the Document Loader:
pip install -U langchain-pymupdf-layout
Usage
You can easily integrate and use the PyMuPDF Layout Loader in your Python application for loading and parsing PDFs.
Below is an example of how to set up and utilize this loader:
from langchain_pymupdf_layout import version
print(version()) # Output: version number
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_pymupdf_layout import PyMuPDFLayoutLoader
loader = PyMuPDFLayoutLoader(
file_path="https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
show_progress=False,
# See other loader options on https://pymupdf.readthedocs.io/en/latest/pymupdf-layout/index.html#pymupdf-layout-and-parameter-caveats
)
documents = loader.load()
# Chunk
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} document(s)")
print(f"Created {len(chunks)} chunk(s)")
content = chunks[0].page_content
print(f"\ncontent:\n{content}")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_pymupdf_layout-0.1.3.tar.gz.
File metadata
- Download URL: langchain_pymupdf_layout-0.1.3.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c5af2315c20a1b6dea5d79d684747f8288d803c3caf274bdf96aff1f2bb8681
|
|
| MD5 |
fc51745ab97fd0920775736597544321
|
|
| BLAKE2b-256 |
403ad2df8725975cbd35ff038ea90d5bde7145f866b3d8554e51823ab59ccaa2
|
File details
Details for the file langchain_pymupdf_layout-0.1.3-py3-none-any.whl.
File metadata
- Download URL: langchain_pymupdf_layout-0.1.3-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68499352327de664b7fa70800ac51362bd11f35447e736847585542cf3547bbd
|
|
| MD5 |
d58cc174fcac7d7d9616f0a888424502
|
|
| BLAKE2b-256 |
35197c1d081e94295b7bd8f9eefd5983d75ba5bca74a23c26843306f40039156
|