Skip to main content

LangChain document loader powered by MinerU — turn PDFs into Markdown

Project description

langchain-mineru

LangChain document loader powered by MinerU — turn PDFs and documents into Markdown with one line of code.

What is langchain-mineru?

langchain-mineru is a LangChain Document Loader deeply integrated into the LangChain ecosystem. It leverages MinerU's document parsing capabilities to convert diverse external data sources into LangChain-compatible Document objects, ready to plug into RAG pipelines. It supports both single-document and multi-document input, and integrates seamlessly with downstream Text Splitter, Embedding, and Vector Store workflows.

  • ✅ Supports PDF / Image / DOCX / PPTx / XLS / XLSX / online URL
  • ✅ Supports single and multi-document input with lazy_load streaming
  • ✅ Optional split_pages mode for PDFs — splits into one Document per page
  • ✅ Two parsing modes: fast (no token) and accurate (token required)
  • ✅ Compatible with LangChain RAG Pipelines — ready for chunking, embedding, and retrieval

What is MinerU?

MinerU is an open-source tool that converts complex documents (PDFs, Word, PPT, images, etc.) into machine-readable formats like Markdown and JSON. It is designed to extract high-quality content for LLM pre-training, RAG, and agentic workflows.

For more details, visit the MinerU GitHub repository.

Installation

Prerequisites

  • Python >= 3.10

Installation Steps

pip install langchain-mineru

Verify

python -c "from langchain_mineru import MinerULoader; print('OK')"

Quick Start

from langchain_mineru import MinerULoader

loader = MinerULoader(source="demo.pdf")
docs = loader.load()

print(docs[0].page_content[:500])
print(docs[0].metadata)

Default is mode="fast" and no API token is required.

Mode Selection

  • fast: Calls MinerU flash API, optimized for speed, no token required, and supports both document and HTML input.
  • accurate: Calls MinerU standard extract API, token required.

You can provide token in two ways:

# Option 1: environment variable (recommended)
export MINERU_TOKEN="your-token"
# Option 2: pass token directly
loader = MinerULoader(source="demo.pdf", mode="accurate", token="your-token")

Usage Examples

Basic Usage

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="demo.pdf",
    split_pages=True,
)

docs = loader.load()
for doc in docs:
    print(f"Page {doc.metadata['page']}: {doc.page_content[:200]}")

With Parameters

Fast Mode (Token Free)

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="/path/to/demo.pdf",
    mode="fast",
    language="en",
    timeout=300,
)

docs = loader.load()
print(docs[0].page_content[:500])

Accurate Mode (Token Required)

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="/path/to/demo.pdf",
    mode="accurate",
    token="your-token",  # or set MINERU_TOKEN
    language="en",
    split_pages=True,
    pages="1-5",
    timeout=300,
    ocr=True,
    formula=True,
    table=True,
)

docs = loader.load()
for doc in docs:
    print("-"*100)
    print(f"Page {doc.metadata['page']}: \n {doc.page_content[:200]}")

Or run the dedicated example script directly:

export MINERU_TOKEN="your-token"
uv run python mineru_example/example_accurate.py

Multiple Sources

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source=[
        "/path/to/demo_a.pdf",
        "/path/to/demo_b.pdf",
        "https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
    ],
)

docs = loader.load()
for doc in docs:
    print(doc.metadata["source"], "-", doc.page_content[:100])

RAG Pipeline

RAG (fast mode, no token)

from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

loader = MinerULoader(source="demo.pdf", mode="fast")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)

vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
    print(r.page_content[:200])

RAG (accurate mode, token required)

from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

loader = MinerULoader(
    source="manual.pdf",
    mode="accurate",
    token="your-token",  # or set MINERU_TOKEN
    ocr=True,
    formula=True,
    table=True,
)
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)

vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
    print(r.page_content[:200])

Parameters

Parameter Type Default Description
source str | list[str] required Local file path(s) or URL(s). Supports PDF, DOCX, PPTX, images, and online URLs.
mode str "fast" Parsing mode. "fast" is speed-first and token-free; "accurate" uses standard API and requires token.
token str | None None MinerU API token. Required for mode="accurate". If omitted, MINERU_TOKEN environment variable is used.
language str "ch" Document language code for OCR. Common values: "ch" (Chinese), "en" (English), "auto" (auto-detect). For the complete list, refer to the standard API documentation.
pages str | None None Page range to extract, e.g. "1-5" or "3". Only applies to PDF files. When split_pages=False, the range is forwarded to the API. When split_pages=True, only the specified pages are split and parsed locally — reducing API calls and processing time.
timeout int 1200 Maximum seconds to wait for extraction per file.
split_pages bool False PDF only. When True, splits the PDF into one Document per page. Each page is parsed independently, so metadata["page"] is available. Non-PDF files are unaffected — they always produce one Document.
ocr bool False Effective when mode="accurate". In mode="fast", OCR is built in and this parameter is ignored.
formula bool True Effective only when mode="accurate". Enables formula recognition. Passing non-default value in mode="fast" raises an error.
table bool True Effective only when mode="accurate". Enables table recognition. Passing non-default value in mode="fast" raises an error.

Document Metadata

Each returned Document includes the following metadata:

{
    "source": "report.pdf",          # original source path or URL
    "loader": "mineru",
    "output_format": "markdown",
    "mode": "fast",                  # fast / accurate
    "language": "ch",
    "pages": None,
    "split_pages": False,
    "filename": "report.pdf",
    "page": 1,                       # only present when split_pages=True
    "page_source": "report.pdf",     # only present when split_pages=True
}

Supported File Formats

PDF, DOC, DOCX, PPT, PPTX, PNG, JPG, JPEG

Limitations

  • Output format is Markdown only
  • fast mode follows flash API limits (such as page/file constraints)
  • accurate mode requires a valid token and available account quota

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_mineru-0.1.0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_mineru-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file langchain_mineru-0.1.0.tar.gz.

File metadata

  • Download URL: langchain_mineru-0.1.0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_mineru-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5877b24390460c18adc69fd26f4ebf9873c6c2d29705f80e2e347008eda9e7fa
MD5 71c32192869fc05870c5066c15b0b5a9
BLAKE2b-256 59aef44d20d9452e41bcc1bd3f7ada4148a5288748ca42c18f0e4acb9286a90d

See more details on using hashes here.

File details

Details for the file langchain_mineru-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: langchain_mineru-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_mineru-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5f00c16d3163921625fadc415dff523119b95de14070b096c227c6c5789e19f
MD5 2839e5b8278b38d1ab7726ef6b15f355
BLAKE2b-256 445a56b6ca2f4559902c464777b994f307e1ac7d83dbb15dca1b1eefb21ab120

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page