LangChain document loader powered by MinerU — turn PDFs into Markdown
Project description
langchain-mineru
LangChain document loader powered by MinerU — turn PDFs and documents into Markdown with one line of code.
What is langchain-mineru?
langchain-mineru is a LangChain Document Loader deeply integrated into the LangChain ecosystem. It leverages MinerU's document parsing capabilities to convert diverse external data sources into LangChain-compatible Document objects, ready to plug into RAG pipelines. It supports both single-document and multi-document input, and integrates seamlessly with downstream Text Splitter, Embedding, and Vector Store workflows.
- ✅
precisionmode supports: .pdf, images, .DOC, .DOCX, .PPT, .PPTX, html - ✅
flashmode supports: .pdf, images, DOCX, PPTX, XLS, XLSX - ✅ Supports single and multi-document input with
lazy_loadstreaming - ✅ Optional
split_pagesmode for PDFs — splits into oneDocumentper page - ✅ Two parsing modes:
flash(no token) andprecision(token required) - ✅ Compatible with LangChain RAG Pipelines — ready for chunking, embedding, and retrieval
What is MinerU?
MinerU is an open-source tool that converts complex documents (PDFs, Word, PPT, images, etc.) into machine-readable formats like Markdown and JSON. It is designed to extract high-quality content for LLM pre-training, RAG, and agentic workflows.
For more details, visit the MinerU GitHub repository.
Installation
Prerequisites
- Python >= 3.10
Installation Steps
pip install langchain-mineru
Verify
python -c "from langchain_mineru import MinerULoader; print('OK')"
Quick Start
from langchain_mineru import MinerULoader
loader = MinerULoader(source="demo.pdf")
docs = loader.load()
print(docs[0].page_content[:500])
print(docs[0].metadata)
Default is mode="flash" and no API token is required.
Mode Selection
precision: Calls MinerU standardextractAPI. Token required. Supported formats: .pdf, images, .DOC, .DOCX, .PPT, .PPTX, html.flash: Calls MinerU flash API, optimized for speed, no token required. Supported formats: .pdf, images, DOCX, PPTX, XLS, XLSX.
Apply for a precision mode token here: https://mineru.net/apiManage/token.
You can provide token in two ways:
# Option 1: environment variable (recommended)
export MINERU_TOKEN="your-token"
# Option 2: pass token directly
loader = MinerULoader(source="demo.pdf", mode="precision", token="your-token")
Usage Examples
Basic Usage
from langchain_mineru import MinerULoader
loader = MinerULoader(
source="demo.pdf",
split_pages=True,
)
docs = loader.load()
for doc in docs:
print(f"Page {doc.metadata['page']}: {doc.page_content[:200]}")
With Parameters
Flash Mode (Token Free)
from langchain_mineru import MinerULoader
loader = MinerULoader(
source="/path/to/demo.pdf",
mode="flash",
language="en",
timeout=300,
)
docs = loader.load()
print(docs[0].page_content[:500])
Precision Mode (Token Required)
from langchain_mineru import MinerULoader
loader = MinerULoader(
source="/path/to/demo.pdf",
mode="precision",
token="your-token", # or set MINERU_TOKEN
language="en",
split_pages=True,
pages="1-5",
timeout=300,
ocr=True,
formula=True,
table=True,
)
docs = loader.load()
for doc in docs:
print("-"*100)
print(f"Page {doc.metadata['page']}: \n {doc.page_content[:200]}")
Or run the dedicated example script directly:
export MINERU_TOKEN="your-token"
uv run python mineru_example/example_precision.py
Multiple Sources
from langchain_mineru import MinerULoader
loader = MinerULoader(
source=[
"/path/to/demo_a.pdf",
"/path/to/demo_b.pdf",
"https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
],
)
docs = loader.load()
for doc in docs:
print(doc.metadata["source"], "-", doc.page_content[:100])
RAG Pipeline
RAG (flash mode, no token)
from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
loader = MinerULoader(source="demo.pdf", mode="flash")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
print(r.page_content[:200])
RAG (precision mode, token required)
from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
loader = MinerULoader(
source="manual.pdf",
mode="precision",
token="your-token", # or set MINERU_TOKEN
ocr=True,
formula=True,
table=True,
)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
print(r.page_content[:200])
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
str | list[str] |
required | Local file path(s) or URL(s). Supported formats depend on mode: precision supports .pdf, images, .DOC, .DOCX, .PPT, .PPTX, html; flash supports .pdf, images, DOCX, PPTX, XLS, XLSX. |
mode |
str |
"flash" |
Parsing mode. "flash" is speed-first and token-free; "precision" uses standard API and requires token. |
token |
str | None |
None |
MinerU API token. Required for mode="precision". Apply at https://mineru.net/apiManage/token. If omitted, MINERU_TOKEN environment variable is used. |
language |
str |
"ch" |
Document language code for OCR. Common values: "ch" (Chinese), "en" (English). For the complete list, refer to the standard API documentation. |
pages |
str | None |
None |
Page range to extract, e.g. "1-5" or "3". Only applies to PDF files. When split_pages=False, the range is forwarded to the API. When split_pages=True, only the specified pages are split and parsed locally — reducing API calls and processing time. |
timeout |
int |
1200 |
Maximum seconds to wait for extraction per file. |
split_pages |
bool |
False |
PDF only. When True, splits the PDF into one Document per page. Each page is parsed independently, so metadata["page"] is available. Non-PDF files are unaffected — they always produce one Document. |
ocr |
bool |
False |
Effective when mode="precision". In mode="flash", OCR is built in and this parameter is ignored. |
formula |
bool |
True |
Effective only when mode="precision". Enables formula recognition. Passing non-default value in mode="flash" raises an error. |
table |
bool |
True |
Effective only when mode="precision". Enables table recognition. Passing non-default value in mode="flash" raises an error. |
Document Metadata
Each returned Document includes the following metadata:
{
"source": "report.pdf", # original source path or URL
"loader": "mineru",
"output_format": "markdown",
"mode": "flash", # flash / precision
"language": "ch",
"pages": None,
"split_pages": True,
"filename": "report.pdf",
"page": 1, # only present when split_pages=True
"page_source": "report.pdf", # only present when split_pages=True
}
Supported File Formats
precisionmode: .pdf, images, .DOC, .DOCX, .PPT, .PPTX, htmlflashmode: .pdf, images, DOCX, PPTX, XLS, XLSX
Limitations
- Output format is Markdown only
flashmode follows flash API limits (such as page/file constraints)precisionmode requires a valid token and available account quota
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_mineru-0.1.3.tar.gz.
File metadata
- Download URL: langchain_mineru-0.1.3.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dfe6a790920b462b65f0d046009f46e2a572569e65b2945f4196ef8180ea6bd
|
|
| MD5 |
4fff367dc231200d896e7e5a075fe50c
|
|
| BLAKE2b-256 |
d5a36b9ebb6c4d57a4cc66bd57901dbb45c9f3a122df2056c6119a3902bf056d
|
File details
Details for the file langchain_mineru-0.1.3-py3-none-any.whl.
File metadata
- Download URL: langchain_mineru-0.1.3-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b3cf6a669d7d3d5093af05ee9d27f9ecb073b3ca76951b25e1ab1f4e2c2c8ba
|
|
| MD5 |
fdcce757f1839bace4a4a7382c1bb031
|
|
| BLAKE2b-256 |
fcc1496e55b801d14966bbf97cd4f950382ff34b5d65f72cd911bcf44562fccf
|