llama-index readers MinerU integration — parse PDF/Doc/PPT/images into Markdown via MinerU API
Project description
MinerU Reader
pip install llama-index-readers-mineru
This reader uses the MinerU document parsing API to extract high-quality Markdown from PDF, Doc/Docx, PPT/PPTx, images, and Excel files. It supports two parsing modes:
| Feature | Flash (default) | Precision |
|---|---|---|
| Auth | No token required | Token required |
| Speed | Blazing fast | Standard |
| File size | Max 10 MB | Max 200 MB |
| Page limit | Max 20 pages | Max 600 pages |
| Formula / Table | Disabled | Configurable |
| Output in this Reader | Markdown only | Markdown only |
Note: MinerU Python SDK precision mode supports extra output formats (images/JSON/docx/html/latex), but
MinerUReaderin this integration currently returns onlyresult.markdownasDocument.text.
Usage
Flash Mode (default, no token needed)
from llama_index.readers.mineru import MinerUReader
reader = MinerUReader()
# Parse a single PDF from URL
documents = reader.load_data(
"https://cdn-mineru.openxlab.org.cn/demo/example.pdf"
)
print(documents[0].text)
# Parse a local file
documents = reader.load_data("/path/to/local.pdf")
# Parse multiple files at once
documents = reader.load_data(
[
"https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
"/path/to/local.pdf",
]
)
Precision Mode (token required)
Get your free token from MinerU API Management.
from llama_index.readers.mineru import MinerUReader
reader = MinerUReader(
mode="precision",
token="your-api-token", # or set MINERU_TOKEN env var
ocr=True,
formula=True,
table=True,
language="en",
pages="1-20",
)
documents = reader.load_data("/path/to/scanned_paper.pdf")
Mixed Sources (local path + URL)
You can parse local files and remote URLs in one call:
from llama_index.readers.mineru import MinerUReader
reader = MinerUReader()
documents = reader.load_data(
[
"/path/to/local_a.pdf",
"/path/to/local_b.docx",
"https://cdn-mineru.openxlab.org.cn/demo/example.pdf",
]
)
for doc in documents:
print(doc.metadata["source"], "-", doc.text[:100])
Attach custom metadata with extra_info
Use extra_info when you want to merge custom metadata fields into every
returned Document.metadata (for example project/tenant/tag). It does not
change parsing behavior or output format.
from llama_index.readers.mineru import MinerUReader
reader = MinerUReader()
documents = reader.load_data(
"/path/to/paper.pdf",
extra_info={
"project": "paper-rag",
"tenant": "team-a",
"source_type": "research_pdf",
},
)
print(documents[0].metadata["project"]) # paper-rag
print(documents[0].metadata["source_type"]) # research_pdf
print(documents[0].text[:120]) # still Markdown text
Per-Page Splitting
When split_pages=True, each PDF page becomes a separate Document — ideal for RAG pipelines that need page-level granularity.
reader = MinerUReader(split_pages=True, pages="1-5")
documents = reader.load_data("/path/to/paper.pdf")
for doc in documents:
print(f"Page {doc.metadata['page']}: {doc.text[:100]}...")
Page Range + Split Pages (PDF only)
Use pages together with split_pages=True to parse only selected PDF pages.
In this mode, each selected page becomes one Document.
from llama_index.readers.mineru import MinerUReader
reader = MinerUReader(
mode="precision",
token="your-api-token", # or set MINERU_TOKEN
pages="2-4",
split_pages=True,
language="en",
)
documents = reader.load_data("/path/to/paper.pdf")
for doc in documents:
print(doc.metadata.get("page"), doc.metadata.get("source"))
Use with LlamaIndex Pipeline
from llama_index.core import VectorStoreIndex
from llama_index.readers.mineru import MinerUReader
reader = MinerUReader()
documents = reader.load_data("/path/to/paper.pdf")
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")
print(response)
Parameters
Reader initialization (MinerUReader(...))
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
str |
"flash" |
Parsing mode: "flash" or "precision" |
token |
str | None |
None |
MinerU API token (precision mode). Falls back to MINERU_TOKEN env var. Apply here: https://mineru.net/apiManage/token |
language |
str |
"ch" |
Document language code |
pages |
str | None |
None |
Page range, e.g. "1-10" |
timeout |
int |
600 |
Max seconds to wait for task completion |
split_pages |
bool |
False |
Split PDF into per-page Documents |
ocr |
bool |
False |
Enable OCR (precision mode only) |
formula |
bool |
True |
Enable formula recognition (precision mode only) |
table |
bool |
True |
Enable table recognition (precision mode only) |
load_data(...) arguments
| Parameter | Type | Default | Description |
|---|---|---|---|
sources |
str | Path | list[str | Path] |
— | Single file path/URL, or a list of file paths/URLs |
extra_info |
dict | None |
None |
Custom metadata merged into each returned Document.metadata |
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_mineru-0.1.1.tar.gz.
File metadata
- Download URL: llama_index_readers_mineru-0.1.1.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f60f8d4111a136544144519bc5aa5627fbbb74931646481f9850660dfb238dd
|
|
| MD5 |
2d2e9703fbdb457d5d59da5381f9af07
|
|
| BLAKE2b-256 |
56e6850fc2426693ece5203027c3c555356ee713c7aa3af075d84d400c586ed3
|
File details
Details for the file llama_index_readers_mineru-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_mineru-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
feee82a23296905669503f4dd9d7df36bf07b9e847724a3c839cb95589f808aa
|
|
| MD5 |
5a7da08a8b2f6902460826c3d5da2e62
|
|
| BLAKE2b-256 |
5068a5f3c37480eaf0730fd64595b44522e2d1745defd37a915580cc294e2ec7
|