Skip to main content

LangChain document loader powered by MinerU — turn PDFs into Markdown

Project description

langchain-mineru

LangChain document loader powered by MinerU — turn PDFs and documents into Markdown with one line of code.

What is langchain-mineru?

langchain-mineru is a LangChain Document Loader deeply integrated into the LangChain ecosystem. It leverages MinerU's document parsing capabilities to convert diverse external data sources into LangChain-compatible Document objects, ready to plug into RAG pipelines. It supports both single-document and multi-document input, and integrates seamlessly with downstream Text Splitter, Embedding, and Vector Store workflows.

  • accurate mode supports: .pdf, images, .DOC, .DOCX, .PPT, .PPTX, html
  • fast mode supports: .pdf, images, DOCX, PPTX, XLS, XLSX
  • ✅ Supports single and multi-document input with lazy_load streaming
  • ✅ Optional split_pages mode for PDFs — splits into one Document per page
  • ✅ Two parsing modes: fast (no token) and accurate (token required)
  • ✅ Compatible with LangChain RAG Pipelines — ready for chunking, embedding, and retrieval

What is MinerU?

MinerU is an open-source tool that converts complex documents (PDFs, Word, PPT, images, etc.) into machine-readable formats like Markdown and JSON. It is designed to extract high-quality content for LLM pre-training, RAG, and agentic workflows.

For more details, visit the MinerU GitHub repository.

Installation

Prerequisites

  • Python >= 3.10

Installation Steps

pip install langchain-mineru

Verify

python -c "from langchain_mineru import MinerULoader; print('OK')"

Quick Start

from langchain_mineru import MinerULoader

loader = MinerULoader(source="demo.pdf")
docs = loader.load()

print(docs[0].page_content[:500])
print(docs[0].metadata)

Default is mode="fast" and no API token is required.

Mode Selection

  • accurate: Calls MinerU standard extract API. Token required. Supported formats: .pdf, images, .DOC, .DOCX, .PPT, .PPTX, html.
  • fast: Calls MinerU flash API, optimized for speed, no token required. Supported formats: .pdf, images, DOCX, PPTX, XLS, XLSX.

Apply for an accurate mode token here: https://mineru.net/apiManage/token.

You can provide token in two ways:

# Option 1: environment variable (recommended)
export MINERU_TOKEN="your-token"
# Option 2: pass token directly
loader = MinerULoader(source="demo.pdf", mode="accurate", token="your-token")

Usage Examples

Basic Usage

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="demo.pdf",
    split_pages=True,
)

docs = loader.load()
for doc in docs:
    print(f"Page {doc.metadata['page']}: {doc.page_content[:200]}")

With Parameters

Fast Mode (Token Free)

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="/path/to/demo.pdf",
    mode="fast",
    language="en",
    timeout=300,
)

docs = loader.load()
print(docs[0].page_content[:500])

Accurate Mode (Token Required)

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source="/path/to/demo.pdf",
    mode="accurate",
    token="your-token",  # or set MINERU_TOKEN
    language="en",
    split_pages=True,
    pages="1-5",
    timeout=300,
    ocr=True,
    formula=True,
    table=True,
)

docs = loader.load()
for doc in docs:
    print("-"*100)
    print(f"Page {doc.metadata['page']}: \n {doc.page_content[:200]}")

Or run the dedicated example script directly:

export MINERU_TOKEN="your-token"
uv run python mineru_example/example_accurate.py

Multiple Sources

from langchain_mineru import MinerULoader

loader = MinerULoader(
    source=[
        "/path/to/demo_a.pdf",
        "/path/to/demo_b.pdf",
        "https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
    ],
)

docs = loader.load()
for doc in docs:
    print(doc.metadata["source"], "-", doc.page_content[:100])

RAG Pipeline

RAG (fast mode, no token)

from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

loader = MinerULoader(source="demo.pdf", mode="fast")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)

vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
    print(r.page_content[:200])

RAG (accurate mode, token required)

from langchain_mineru import MinerULoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

loader = MinerULoader(
    source="manual.pdf",
    mode="accurate",
    token="your-token",  # or set MINERU_TOKEN
    ocr=True,
    formula=True,
    table=True,
)
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(docs)

vs = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vs.similarity_search("what are the core setup steps in this document?", k=3)
for r in results:
    print(r.page_content[:200])

Parameters

Parameter Type Default Description
source str | list[str] required Local file path(s) or URL(s). Supported formats depend on mode: accurate supports .pdf, images, .DOC, .DOCX, .PPT, .PPTX, html; fast supports .pdf, images, DOCX, PPTX, XLS, XLSX.
mode str "fast" Parsing mode. "fast" is speed-first and token-free; "accurate" uses standard API and requires token.
token str | None None MinerU API token. Required for mode="accurate". Apply at https://mineru.net/apiManage/token. If omitted, MINERU_TOKEN environment variable is used.
language str "ch" Document language code for OCR. Common values: "ch" (Chinese), "en" (English). For the complete list, refer to the standard API documentation.
pages str | None None Page range to extract, e.g. "1-5" or "3". Only applies to PDF files. When split_pages=False, the range is forwarded to the API. When split_pages=True, only the specified pages are split and parsed locally — reducing API calls and processing time.
timeout int 1200 Maximum seconds to wait for extraction per file.
split_pages bool False PDF only. When True, splits the PDF into one Document per page. Each page is parsed independently, so metadata["page"] is available. Non-PDF files are unaffected — they always produce one Document.
ocr bool False Effective when mode="accurate". In mode="fast", OCR is built in and this parameter is ignored.
formula bool True Effective only when mode="accurate". Enables formula recognition. Passing non-default value in mode="fast" raises an error.
table bool True Effective only when mode="accurate". Enables table recognition. Passing non-default value in mode="fast" raises an error.

Document Metadata

Each returned Document includes the following metadata:

{
    "source": "report.pdf",          # original source path or URL
    "loader": "mineru",
    "output_format": "markdown",
    "mode": "fast",                  # fast / accurate
    "language": "ch",
    "pages": None,
    "split_pages": True,
    "filename": "report.pdf",
    "page": 1,                       # only present when split_pages=True
    "page_source": "report.pdf",     # only present when split_pages=True
}

Supported File Formats

  • accurate mode: .pdf, images, .DOC, .DOCX, .PPT, .PPTX, html
  • fast mode: .pdf, images, DOCX, PPTX, XLS, XLSX

Limitations

  • Output format is Markdown only
  • fast mode follows flash API limits (such as page/file constraints)
  • accurate mode requires a valid token and available account quota

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_mineru-0.1.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_mineru-0.1.1-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file langchain_mineru-0.1.1.tar.gz.

File metadata

  • Download URL: langchain_mineru-0.1.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_mineru-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c39ca79b603a50866943dde40323ccf921c076cf2f41b2ca89239098ad50f1b1
MD5 b3ccb4ffa27aaf7b8a6cd43bd1948fef
BLAKE2b-256 dcdf620eed9a17478863b6eaf9bc22d0dc13f6985c571a57a5afbe6655a7bbdc

See more details on using hashes here.

File details

Details for the file langchain_mineru-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: langchain_mineru-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_mineru-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a6c23370c60e44ff9adcbfab69662e77bea893ee4969841d5b756e2b417467c7
MD5 7c362c669866fe4c75cf51bccdfd4588
BLAKE2b-256 cb55c1aa8cfa604ac860a99eef3d5123533fcacc1424a56c3b804faac88628e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page