LangChain document loader powered by MinerU — turn PDFs into Markdown

Project description

langchain-mineru

LangChain document loader powered by MinerU — turn PDFs and documents into Markdown with one line of code.

What is langchain-mineru?

langchain-mineru is a LangChain Document Loader deeply integrated into the LangChain ecosystem. It leverages MinerU's document parsing capabilities to convert diverse external data sources into LangChain-compatible Document objects, ready to plug into RAG pipelines. It supports both single-document and multi-document input, and integrates seamlessly with downstream Text Splitter, Embedding, and Vector Store workflows.

✅ Supports PDF / Image / DOCX / PPTx / XLS / XLSX / online URL
✅ Supports single and multi-document input with lazy_load streaming
✅ Optional split_pages mode for PDFs — splits into one Document per page
✅ Two parsing modes: fast (no token) and accurate (token required)
✅ Compatible with LangChain RAG Pipelines — ready for chunking, embedding, and retrieval

What is MinerU?

MinerU is an open-source tool that converts complex documents (PDFs, Word, PPT, images, etc.) into machine-readable formats like Markdown and JSON. It is designed to extract high-quality content for LLM pre-training, RAG, and agentic workflows.

For more details, visit the MinerU GitHub repository.

Installation

Prerequisites

Python >= 3.10

Installation Steps

pip install langchain-mineru

Verify

python -c "from langchain_mineru import MinerULoader; print('OK')"

Quick Start

from langchain_mineru import MinerULoader

loader = MinerULoader(source="demo.pdf")
docs = loader.load()

print(docs[0].page_content[:500])
print(docs[0].metadata)

Default is mode="fast" and no API token is required.

Mode Selection

fast: Calls MinerU flash API, optimized for speed, no token required, and supports both document and HTML input.
accurate: Calls MinerU standard extract API, token required.

You can provide token in two ways:

# Option 1: environment variable (recommended)
export MINERU_TOKEN="your-token"

# Option 2: pass token directly
loader = MinerULoader(source="demo.pdf", mode="accurate", token="your-token")

Usage Examples

Basic Usage

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="demo.pdf",
    split_pages=True,
)

docs = loader.load()
for doc in docs:
    print(f"Page {doc.metadata['page']}: {doc.page_content[:200]}")

With Parameters

Fast Mode (Token Free)

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="/path/to/demo.pdf",
    mode="fast",
    language="en",
    timeout=300,
)

docs = loader.load()
print(docs[0].page_content[:500])

Accurate Mode (Token Required)

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="/path/to/demo.pdf",
    mode="accurate",
    token="your-token",  # or set MINERU_TOKEN
    language="en",
    split_pages=True,
    pages="1-5",
    timeout=300,
    ocr=True,
    formula=True,
    table=True,
)

docs = loader.load()
for doc in docs:
    print("-"*100)
    print(f"Page {doc.metadata['page']}: \n {doc.page_content[:200]}")

Or run the dedicated example script directly:

export MINERU_TOKEN="your-token"
uv run python mineru_example/example_accurate.py

Multiple Sources

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source=[
        "/path/to/demo_a.pdf",
        "/path/to/demo_b.pdf",
        "https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
    ],
)

docs = loader.load()
for doc in docs:
    print(doc.metadata["source"], "-", doc.page_content[:100])

RAG Pipeline

RAG (fast mode, no token)

from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

loader = MinerULoader(source="demo.pdf", mode="fast")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)

vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
    print(r.page_content[:200])

RAG (accurate mode, token required)

from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

loader = MinerULoader(
    source="manual.pdf",
    mode="accurate",
    token="your-token",  # or set MINERU_TOKEN
    ocr=True,
    formula=True,
    table=True,
)
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)

vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
    print(r.page_content[:200])

Parameters

Parameter	Type	Default	Description
`source`	`str \| list[str]`	required	Local file path(s) or URL(s). Supports PDF, DOCX, PPTX, images, and online URLs.
`mode`	`str`	`"fast"`	Parsing mode. `"fast"` is speed-first and token-free; `"accurate"` uses standard API and requires token.
`token`	`str \| None`	`None`	MinerU API token. Required for `mode="accurate"`. If omitted, `MINERU_TOKEN` environment variable is used.
`language`	`str`	`"ch"`	Document language code for OCR. Common values: `"ch"` (Chinese), `"en"` (English), `"auto"` (auto-detect). For the complete list, refer to the standard API documentation.
`pages`	`str \| None`	`None`	Page range to extract, e.g. `"1-5"` or `"3"`. Only applies to PDF files. When `split_pages=False`, the range is forwarded to the API. When `split_pages=True`, only the specified pages are split and parsed locally — reducing API calls and processing time.
`timeout`	`int`	`1200`	Maximum seconds to wait for extraction per file.
`split_pages`	`bool`	`False`	PDF only. When `True`, splits the PDF into one `Document` per page. Each page is parsed independently, so `metadata["page"]` is available. Non-PDF files are unaffected — they always produce one `Document`.
`ocr`	`bool`	`False`	Effective when `mode="accurate"`. In `mode="fast"`, OCR is built in and this parameter is ignored.
`formula`	`bool`	`True`	Effective only when `mode="accurate"`. Enables formula recognition. Passing non-default value in `mode="fast"` raises an error.
`table`	`bool`	`True`	Effective only when `mode="accurate"`. Enables table recognition. Passing non-default value in `mode="fast"` raises an error.

Document Metadata

Each returned Document includes the following metadata:

{
    "source": "report.pdf",          # original source path or URL
    "loader": "mineru",
    "output_format": "markdown",
    "mode": "fast",                  # fast / accurate
    "language": "ch",
    "pages": None,
    "split_pages": False,
    "filename": "report.pdf",
    "page": 1,                       # only present when split_pages=True
    "page_source": "report.pdf",     # only present when split_pages=True
}

Supported File Formats

PDF, DOC, DOCX, PPT, PPTX, PNG, JPG, JPEG

Limitations

Output format is Markdown only
fast mode follows flash API limits (such as page/file constraints)
accurate mode requires a valid token and available account quota

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

0.1.3

Mar 20, 2026

0.1.1

Mar 20, 2026

This version

0.1.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_mineru-0.1.0.tar.gz (7.0 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langchain_mineru-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file langchain_mineru-0.1.0.tar.gz.

File metadata

Download URL: langchain_mineru-0.1.0.tar.gz
Upload date: Mar 20, 2026
Size: 7.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_mineru-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5877b24390460c18adc69fd26f4ebf9873c6c2d29705f80e2e347008eda9e7fa`
MD5	`71c32192869fc05870c5066c15b0b5a9`
BLAKE2b-256	`59aef44d20d9452e41bcc1bd3f7ada4148a5288748ca42c18f0e4acb9286a90d`

See more details on using hashes here.

File details

Details for the file langchain_mineru-0.1.0-py3-none-any.whl.

File metadata

Download URL: langchain_mineru-0.1.0-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 9.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_mineru-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5f00c16d3163921625fadc415dff523119b95de14070b096c227c6c5789e19f`
MD5	`2839e5b8278b38d1ab7726ef6b15f355`
BLAKE2b-256	`445a56b6ca2f4559902c464777b994f307e1ac7d83dbb15dca1b1eefb21ab120`

See more details on using hashes here.

langchain-mineru 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

langchain-mineru

What is langchain-mineru?

What is MinerU?

Installation

Prerequisites

Installation Steps

Verify

Quick Start

Mode Selection

Usage Examples

Basic Usage

With Parameters

Fast Mode (Token Free)

Accurate Mode (Token Required)

Multiple Sources

RAG Pipeline

RAG (fast mode, no token)

RAG (accurate mode, token required)

Parameters

Document Metadata

Supported File Formats

Limitations

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes