Skip to main content

llama-index readers MinerU integration — parse PDF/Doc/PPT/images into Markdown via MinerU API

Project description

MinerU Reader

pip install llama-index-readers-mineru

This reader uses the MinerU document parsing API to extract high-quality Markdown from PDF, Doc/Docx, PPT/PPTx, images, and Excel files. It supports two parsing modes:

Feature Flash (default) Precision
Auth No token required Token required
Speed Blazing fast Standard
File size Max 10 MB Max 200 MB
Page limit Max 20 pages Max 600 pages
Formula / Table Disabled Configurable
Output in this Reader Markdown only Markdown only

Note: MinerU Python SDK precision mode supports extra output formats (images/JSON/docx/html/latex), but MinerUReader in this integration currently returns only result.markdown as Document.text.

Usage

Flash Mode (default, no token needed)

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()

# Parse a single PDF from URL
documents = reader.load_data(
    "https://cdn-mineru.openxlab.org.cn/demo/example.pdf"
)
print(documents[0].text)

# Parse a local file
documents = reader.load_data("/path/to/local.pdf")

# Parse multiple files at once
documents = reader.load_data(
    [
        "https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
        "/path/to/local.pdf",
    ]
)

Precision Mode (token required)

Get your free token from MinerU API Management.

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader(
    mode="precision",
    token="your-api-token",  # or set MINERU_TOKEN env var
    ocr=True,
    formula=True,
    table=True,
    language="en",
    pages="1-20",
)

documents = reader.load_data("/path/to/scanned_paper.pdf")

Mixed Sources (local path + URL)

You can parse local files and remote URLs in one call:

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()
documents = reader.load_data(
    [
        "/path/to/local_a.pdf",
        "/path/to/local_b.docx",
        "https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
    ]
)

for doc in documents:
    print(doc.metadata["source"], "-", doc.text[:100])

Attach custom metadata with extra_info

Use extra_info when you want to merge custom metadata fields into every returned Document.metadata (for example project/tenant/tag). It does not change parsing behavior or output format.

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()
documents = reader.load_data(
    "/path/to/paper.pdf",
    extra_info={
        "project": "paper-rag",
        "tenant": "team-a",
        "source_type": "research_pdf",
    },
)

print(documents[0].metadata["project"])      # paper-rag
print(documents[0].metadata["source_type"])  # research_pdf
print(documents[0].text[:120])               # still Markdown text

Per-Page Splitting

When split_pages=True, each PDF page becomes a separate Document — ideal for RAG pipelines that need page-level granularity.

reader = MinerUReader(split_pages=True, pages="1-5")
documents = reader.load_data("/path/to/paper.pdf")

for doc in documents:
    print(f"Page {doc.metadata['page']}: {doc.text[:100]}...")

Page Range + Split Pages (PDF only)

Use pages together with split_pages=True to parse only selected PDF pages. In this mode, each selected page becomes one Document.

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader(
    mode="precision",
    token="your-api-token",  # or set MINERU_TOKEN
    pages="2-4",
    split_pages=True,
    language="en",
)

documents = reader.load_data("/path/to/paper.pdf")
for doc in documents:
    print(doc.metadata.get("page"), doc.metadata.get("source"))

Use with LlamaIndex Pipeline

from llama_index.core import VectorStoreIndex
from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()
documents = reader.load_data("/path/to/paper.pdf")

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")
print(response)

Parameters

Reader initialization (MinerUReader(...))

Parameter Type Default Description
mode str "flash" Parsing mode: "flash" or "precision"
token str | None None MinerU API token (precision mode). Falls back to MINERU_TOKEN env var. Apply here: https://mineru.net/apiManage/token
language str "ch" Document language code
pages str | None None Page range, e.g. "1-10"
timeout int 600 Max seconds to wait for task completion
split_pages bool False Split PDF into per-page Documents
ocr bool False Enable OCR (precision mode only)
formula bool True Enable formula recognition (precision mode only)
table bool True Enable table recognition (precision mode only)

load_data(...) arguments

Parameter Type Default Description
sources str | Path | list[str | Path] Single file path/URL, or a list of file paths/URLs
extra_info dict | None None Custom metadata merged into each returned Document.metadata

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_mineru-0.1.1.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_mineru-0.1.1-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_mineru-0.1.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_mineru-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4f60f8d4111a136544144519bc5aa5627fbbb74931646481f9850660dfb238dd
MD5 2d2e9703fbdb457d5d59da5381f9af07
BLAKE2b-256 56e6850fc2426693ece5203027c3c555356ee713c7aa3af075d84d400c586ed3

See more details on using hashes here.

File details

Details for the file llama_index_readers_mineru-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_mineru-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 feee82a23296905669503f4dd9d7df36bf07b9e847724a3c839cb95589f808aa
MD5 5a7da08a8b2f6902460826c3d5da2e62
BLAKE2b-256 5068a5f3c37480eaf0730fd64595b44522e2d1745defd37a915580cc294e2ec7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page