MinerU API Python SDK — one line to turn documents into Markdown
Project description
MinerU Open API SDK (Python)
MinerU Open API SDK is a completely free Python library for the MinerU document extraction service. Turn any document (PDF, Images, Word, PPT, Excel) or Web Page into high-quality Markdown with just one line of code.
🚀 Key Features
- Completely Free: No hidden costs for document extraction.
- Flash Extract (No Auth): Extract text instantly without an API token.
- Precision Extract: Comprehensive extraction with layout preservation, images, and formula support.
- Batch & Polling Primitives: Blocking methods for simple flows plus submit/query methods for asynchronous workflows.
- Simple Save Helpers: Save Markdown, HTML, LaTeX, DOCX, or the full extracted zip with built-in helpers.
📦 Install
pip install mineru-open-sdk
🛠️ Quick Start
1. Flash Extract (Fast, No Auth, Markdown-only)
Ideal for quick previews. No token required.
from mineru import MinerU
# No token needed for Flash Extract
client = MinerU()
result = client.flash_extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")
print(result.markdown)
2. Precision Extract (Auth Required)
Supports large files, rich assets (images/tables), and multiple formats.
from mineru import MinerU
# Get your free token from https://mineru.net/apiManage/token
client = MinerU("your-api-token")
result = client.extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")
print(result.markdown)
print(result.images) # Access extracted images
🧩 Supported Public API
Client lifecycle
MinerU(token: str | None = None, base_url: str = ..., flash_base_url: str | None = None)client.close()client.set_source("your-app")- context manager support:
with MinerU(...) as client:
Blocking extraction methods
client.extract(...) -> ExtractResultclient.extract_batch(...) -> Iterator[ExtractResult]client.crawl(...) -> ExtractResultclient.crawl_batch(...) -> Iterator[ExtractResult]client.flash_extract(...) -> ExtractResult
Submit/query methods
client.submit(...) -> strclient.submit_batch(...) -> strclient.get_batch(batch_id) -> list[ExtractResult]client.get_task(task_id) -> ExtractResult
Result helpers
result.save_markdown(path, with_images=True)result.save_docx(path)result.save_html(path)result.save_latex(path)result.save_all(dir)image.save(path)
Result fields you will usually use
result.stateresult.progressresult.markdownresult.imagesresult.content_listresult.docxresult.htmlresult.latexresult.task_id
📊 Mode Comparison
| Feature | Flash Extract | Precision Extract |
|---|---|---|
| Auth | No Auth Required | Auth Required (Token) |
| Speed | Blazing Fast | Standard |
| File Limit | Max 10 MB | Max 200 MB |
| Page Limit | Max 20 Pages | Max 600 Pages |
| Formats | PDF, Images, Docx, PPTx, Excel | PDF, Images, Doc/x, Ppt/x, Html |
| Content | Markdown only (Placeholders) | Full assets (Images, Tables, Formulas) |
| Output | Markdown | MD, Docx, LaTeX, HTML, JSON |
⚙️ Defaults And Option Behavior
MinerU(...)
| Argument | Default | Behavior |
|---|---|---|
token |
None |
If omitted, the SDK reads MINERU_TOKEN from the environment |
base_url |
https://mineru.net/api/v4 |
Standard API base URL |
flash_base_url |
SDK default flash URL | Override flash API endpoint for testing/private deployments |
If neither token nor MINERU_TOKEN is set, the client works in flash-only mode: flash_extract() works, while auth-required methods raise NoAuthClientError.
Precision methods
These defaults apply to extract(), extract_batch(), submit(), submit_batch(), and indirectly to crawl() / crawl_batch() unless noted otherwise.
| Option | Default | Behavior when omitted |
|---|---|---|
model |
None |
Auto-infers model: .html/.htm uses "html", everything else uses "vlm" |
ocr |
not set | OCR is disabled (API default) |
formula |
not set | Formula recognition is enabled (API default) |
table |
not set | Table recognition is enabled (API default) |
language |
not set | Chinese "ch" (API default) |
pages |
None |
Full document is processed |
extra_formats |
None |
Only the default Markdown/JSON payload is returned |
file_params |
None |
Per-file overrides for batch methods. A dict[str, FileParam] keyed by path/URL, where FileParam has fields pages, ocr, data_id |
timeout |
300 seconds for single-item methods |
Max total polling time for extract() / crawl() |
timeout |
1800 seconds for batch methods |
Max total polling time for extract_batch() / crawl_batch() |
Flash Extract
| Option | Default | Behavior when omitted |
|---|---|---|
language |
"ch" |
Default language is Chinese |
page_range |
None |
Full page range allowed by the flash API |
timeout |
300 seconds |
Max total polling time |
crawl() / crawl_batch()
crawl()is shorthand forextract(url, model="html", ...)crawl_batch()is shorthand forextract_batch(urls, model="html", ...)
📖 Detailed Usage
Precision Extraction Options
result = client.extract(
"./paper.pdf",
model="vlm", # "vlm" | "pipeline" | "html"
ocr=True, # Enable OCR for scanned documents
formula=True, # Formula recognition
table=True, # Table recognition
language="en", # "ch" | "en" | etc.
pages="1-20", # Page range
extra_formats=["docx"], # Export as docx, html, or latex
timeout=600,
)
result.save_all("./output/") # Save markdown and all assets
Context Manager
from mineru import MinerU
with MinerU("your-api-token") as client:
result = client.extract("./paper.pdf")
print(result.markdown)
Batch Processing
# Yields results as they complete
for result in client.extract_batch(["a.pdf", "b.pdf", "c.pdf"]):
print(f"{result.filename}: Done")
Batch With Per-File Pages
from mineru import FileParam
batch_id = client.submit_batch(
["a.pdf", "b.pdf"],
file_params={
"a.pdf": FileParam(pages="1-5"),
"b.pdf": FileParam(pages="10-20"),
},
)
Web Crawling
result = client.crawl("https://www.baidu.com")
print(result.markdown)
🔄 submit() / get_batch() Semantics
This is the part most people get wrong at first:
submit()returns a batch IDsubmit_batch()also returns a batch ID- the common async flow is therefore
submit(...) -> get_batch(batch_id) - recommends staying on the batch-based flow for async polling
Recommended async flow
batch_id = client.submit("large-report.pdf")
# poll the batch until the first item is done
while True:
results = client.get_batch(batch_id)
result = results[0]
if result.state in ("done", "failed"):
break
if result.state == "done":
do_something(result.markdown)
🤖 Integration for AI Agents
The SDK is designed to be easily integrated into LLM workflows. For status updates, you can check result.state and result.progress.
batch_id = client.submit("large-report.pdf")
# ... later ...
result = client.get_batch(batch_id)[0]
if result.state == "done":
do_something(result.markdown)
📄 License
This project is licensed under the Apache-2.0 License.
🔗 Links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mineru_open_sdk-0.2.0.tar.gz.
File metadata
- Download URL: mineru_open_sdk-0.2.0.tar.gz
- Upload date:
- Size: 31.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8445b21c9464e54dbdffb592d6d69b870eb0b981434065f9abcaa2093db7f2b3
|
|
| MD5 |
630e892f47e1657804c9ed7578da6f53
|
|
| BLAKE2b-256 |
6b2ce76a93f8afb5e092c033fbbb68163abfee53da5ad666209afe143e906a6f
|
File details
Details for the file mineru_open_sdk-0.2.0-py3-none-any.whl.
File metadata
- Download URL: mineru_open_sdk-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ff0d51a5561960317e23d76f064b8070ac6d8b3d56f817b76c742974ae5b1f3
|
|
| MD5 |
3bae2fead9e268a713d7efd1ebe1d6f2
|
|
| BLAKE2b-256 |
a00e1d9261dde45f9754f4eea3a2f3b1880f884d78d9c5dfdeac940cea9bbbcb
|