llama-index readers MinerU integration — parse PDF/Doc/PPT/images into Markdown via MinerU API

These details have not been verified by PyPI

Project description

MinerU Reader

pip install llama-index-readers-mineru

This reader uses the MinerU document parsing API to extract high-quality Markdown from PDF, Doc/Docx, PPT/PPTx, images, and Excel files. It supports two parsing modes:

Feature	Flash (default)	Precision
Auth	No token required	Token required
Speed	Blazing fast	Standard
File size	Max 10 MB	Max 200 MB
Page limit	Max 20 pages	Max 600 pages
Formula / Table	Disabled	Configurable
Output in this Reader	Markdown only	Markdown only

Note: MinerU Python SDK precision mode supports extra output formats (images/JSON/docx/html/latex), but MinerUReader in this integration currently returns only result.markdown as Document.text.

Usage

Flash Mode (default, no token needed)

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()

# Parse a single PDF from URL
documents = reader.load_data(
    "https://cdn-mineru.openxlab.org.cn/demo/example.pdf"
)
print(documents[0].text)

# Parse a local file
documents = reader.load_data("/path/to/local.pdf")

# Parse multiple files at once
documents = reader.load_data(
    [
        "https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
        "/path/to/local.pdf",
    ]
)

Precision Mode (token required)

Get your free token from MinerU API Management.

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader(
    mode="precision",
    token="your-api-token",  # or set MINERU_TOKEN env var
    ocr=True,
    formula=True,
    table=True,
    language="en",
    pages="1-20",
)

documents = reader.load_data("/path/to/scanned_paper.pdf")

Mixed Sources (local path + URL)

You can parse local files and remote URLs in one call:

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()
documents = reader.load_data(
    [
        "/path/to/local_a.pdf",
        "/path/to/local_b.docx",
        "https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
    ]
)

for doc in documents:
    print(doc.metadata["source"], "-", doc.text[:100])

Attach custom metadata with `extra_info`

Use extra_info when you want to merge custom metadata fields into every returned Document.metadata (for example project/tenant/tag). It does not change parsing behavior or output format.

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()
documents = reader.load_data(
    "/path/to/paper.pdf",
    extra_info={
        "project": "paper-rag",
        "tenant": "team-a",
        "source_type": "research_pdf",
    },
)

print(documents[0].metadata["project"])      # paper-rag
print(documents[0].metadata["source_type"])  # research_pdf
print(documents[0].text[:120])               # still Markdown text

Per-Page Splitting

When split_pages=True, each PDF page becomes a separate Document — ideal for RAG pipelines that need page-level granularity.

reader = MinerUReader(split_pages=True, pages="1-5")
documents = reader.load_data("/path/to/paper.pdf")

for doc in documents:
    print(f"Page {doc.metadata['page']}: {doc.text[:100]}...")

Page Range + Split Pages (PDF only)

Use pages together with split_pages=True to parse only selected PDF pages. In this mode, each selected page becomes one Document.

from llama_index.readers.mineru import MinerUReader

reader = MinerUReader(
    mode="precision",
    token="your-api-token",  # or set MINERU_TOKEN
    pages="2-4",
    split_pages=True,
    language="en",
)

documents = reader.load_data("/path/to/paper.pdf")
for doc in documents:
    print(doc.metadata.get("page"), doc.metadata.get("source"))

Use with LlamaIndex Pipeline

from llama_index.core import VectorStoreIndex
from llama_index.readers.mineru import MinerUReader

reader = MinerUReader()
documents = reader.load_data("/path/to/paper.pdf")

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")
print(response)

Parameters

Reader initialization (`MinerUReader(...)`)

Parameter	Type	Default	Description
`mode`	`str`	`"flash"`	Parsing mode: `"flash"` or `"precision"`
`token`	`str \| None`	`None`	MinerU API token (precision mode). Falls back to `MINERU_TOKEN` env var. Apply here: https://mineru.net/apiManage/token
`language`	`str`	`"ch"`	Document language code
`pages`	`str \| None`	`None`	Page range, e.g. `"1-10"`
`timeout`	`int`	`600`	Max seconds to wait for task completion
`split_pages`	`bool`	`False`	Split PDF into per-page Documents
`ocr`	`bool`	`False`	Enable OCR (precision mode only)
`formula`	`bool`	`True`	Enable formula recognition (precision mode only)
`table`	`bool`	`True`	Enable table recognition (precision mode only)

`load_data(...)` arguments

Parameter	Type	Default	Description
`sources`	`str \| Path \| list[str \| Path]`	—	Single file path/URL, or a list of file paths/URLs
`extra_info`	`dict \| None`	`None`	Custom metadata merged into each returned `Document.metadata`

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_mineru-0.1.1.tar.gz (8.0 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_readers_mineru-0.1.1-py3-none-any.whl (7.8 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file llama_index_readers_mineru-0.1.1.tar.gz.

File metadata

Download URL: llama_index_readers_mineru-0.1.1.tar.gz
Upload date: Mar 27, 2026
Size: 8.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llama_index_readers_mineru-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`4f60f8d4111a136544144519bc5aa5627fbbb74931646481f9850660dfb238dd`
MD5	`2d2e9703fbdb457d5d59da5381f9af07`
BLAKE2b-256	`56e6850fc2426693ece5203027c3c555356ee713c7aa3af075d84d400c586ed3`

See more details on using hashes here.

File details

Details for the file llama_index_readers_mineru-0.1.1-py3-none-any.whl.

File metadata

Download URL: llama_index_readers_mineru-0.1.1-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 7.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llama_index_readers_mineru-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`feee82a23296905669503f4dd9d7df36bf07b9e847724a3c839cb95589f808aa`
MD5	`5a7da08a8b2f6902460826c3d5da2e62`
BLAKE2b-256	`5068a5f3c37480eaf0730fd64595b44522e2d1745defd37a915580cc294e2ec7`

See more details on using hashes here.

llama-index-readers-mineru 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

MinerU Reader

Usage

Flash Mode (default, no token needed)

Precision Mode (token required)

Mixed Sources (local path + URL)

Attach custom metadata with `extra_info`

Per-Page Splitting

Page Range + Split Pages (PDF only)

Use with LlamaIndex Pipeline

Parameters

Reader initialization (`MinerUReader(...)`)

`load_data(...)` arguments

Links

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

llama-index-readers-mineru 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

MinerU Reader

Usage

Flash Mode (default, no token needed)

Precision Mode (token required)

Mixed Sources (local path + URL)

Attach custom metadata with extra_info

Per-Page Splitting

Page Range + Split Pages (PDF only)

Use with LlamaIndex Pipeline

Parameters

Reader initialization (MinerUReader(...))

load_data(...) arguments

Links

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Attach custom metadata with `extra_info`

Reader initialization (`MinerUReader(...)`)

`load_data(...)` arguments