MinerU API Python SDK — one line to turn documents into Markdown

These details have not been verified by PyPI

Project links

Project description

MinerU Open API SDK (Python)

MinerU Open API SDK is a completely free Python library for the MinerU document extraction service. Turn any document (PDF, Images, Word, PPT, Excel) or Web Page into high-quality Markdown with just one line of code.

🚀 Key Features

Completely Free: No hidden costs for document extraction.
Flash Extract (No Auth): Extract text instantly without an API token.
Precision Extract: Comprehensive extraction with layout preservation, images, and formula support.
Batch & Polling Primitives: Blocking methods for simple flows plus submit/query methods for asynchronous workflows.
Simple Save Helpers: Save Markdown, HTML, LaTeX, DOCX, or the full extracted zip with built-in helpers.

📦 Install

pip install mineru-open-sdk

🛠️ Quick Start

1. Flash Extract (Fast, No Auth, Markdown-only)

Ideal for quick previews. No token required.

from mineru import MinerU

# No token needed for Flash Extract
client = MinerU()
result = client.flash_extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)

2. Precision Extract (Auth Required)

Supports large files, rich assets (images/tables), and multiple formats.

from mineru import MinerU

# Get your free token from https://mineru.net/apiManage/token
client = MinerU("your-api-token")
result = client.extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)
print(result.images) # Access extracted images

🧩 Supported Public API

Client lifecycle

MinerU(token: str | None = None, base_url: str = ..., flash_base_url: str | None = None)
client.close()
client.set_source("your-app")
context manager support: with MinerU(...) as client:

Blocking extraction methods

client.extract(...) -> ExtractResult
client.extract_batch(...) -> Iterator[ExtractResult]
client.crawl(...) -> ExtractResult
client.crawl_batch(...) -> Iterator[ExtractResult]
client.flash_extract(...) -> ExtractResult

Submit/query methods

client.submit(...) -> str
client.submit_batch(...) -> str
client.get_batch(batch_id) -> list[ExtractResult]
client.get_task(task_id) -> ExtractResult

Result helpers

result.save_markdown(path, with_images=True)
result.save_docx(path)
result.save_html(path)
result.save_latex(path)
result.save_all(dir)
image.save(path)

Result fields you will usually use

result.state
result.progress
result.markdown
result.images
result.content_list
result.docx
result.html
result.latex
result.task_id

📊 Mode Comparison

Feature	Flash Extract	Precision Extract
Auth	No Auth Required	Auth Required (Token)
Speed	Blazing Fast	Standard
File Limit	Max 10 MB	Max 200 MB
Page Limit	Max 20 Pages	Max 600 Pages
Formats	PDF, Images, Docx, PPTx, Excel	PDF, Images, Doc/x, Ppt/x, Html
Content	Markdown only (Placeholders)	Full assets (Images, Tables, Formulas)
Output	Markdown	MD, Docx, LaTeX, HTML, JSON

⚙️ Defaults And Option Behavior

`MinerU(...)`

Argument	Default	Behavior
`token`	`None`	If omitted, the SDK reads `MINERU_TOKEN` from the environment
`base_url`	`https://mineru.net/api/v4`	Standard API base URL
`flash_base_url`	SDK default flash URL	Override flash API endpoint for testing/private deployments

If neither token nor MINERU_TOKEN is set, the client works in flash-only mode: flash_extract() works, while auth-required methods raise NoAuthClientError.

Precision methods

These defaults apply to extract(), extract_batch(), submit(), submit_batch(), and indirectly to crawl() / crawl_batch() unless noted otherwise.

Option	Default	Behavior when omitted
`model`	`None`	Auto-infers model: `.html`/`.htm` uses `"html"`, everything else uses `"vlm"`
`ocr`	not set	OCR is disabled (API default)
`formula`	not set	Formula recognition is enabled (API default)
`table`	not set	Table recognition is enabled (API default)
`language`	not set	Chinese `"ch"` (API default)
`pages`	`None`	Full document is processed
`extra_formats`	`None`	Only the default Markdown/JSON payload is returned
`file_params`	`None`	Per-file overrides for batch methods. A `dict[str, FileParam]` keyed by path/URL, where `FileParam` has fields `pages`, `ocr`, `data_id`
`timeout`	`300` seconds for single-item methods	Max total polling time for `extract()` / `crawl()`
`timeout`	`1800` seconds for batch methods	Max total polling time for `extract_batch()` / `crawl_batch()`

Flash Extract

Option	Default	Behavior when omitted
`language`	`"ch"`	Default language is Chinese
`page_range`	`None`	Full page range allowed by the flash API
`timeout`	`300` seconds	Max total polling time

`crawl()` / `crawl_batch()`

crawl() is shorthand for extract(url, model="html", ...)
crawl_batch() is shorthand for extract_batch(urls, model="html", ...)

📖 Detailed Usage

Precision Extraction Options

result = client.extract(
    "./paper.pdf",
    model="vlm",             # "vlm" | "pipeline" | "html"
    ocr=True,                # Enable OCR for scanned documents
    formula=True,            # Formula recognition
    table=True,              # Table recognition
    language="en",           # "ch" | "en" | etc.
    pages="1-20",            # Page range
    extra_formats=["docx"],  # Export as docx, html, or latex
    timeout=600,
)

result.save_all("./output/") # Save markdown and all assets

Context Manager

from mineru import MinerU

with MinerU("your-api-token") as client:
    result = client.extract("./paper.pdf")
    print(result.markdown)

Batch Processing

# Yields results as they complete
for result in client.extract_batch(["a.pdf", "b.pdf", "c.pdf"]):
    print(f"{result.filename}: Done")

Batch With Per-File Pages

from mineru import FileParam

batch_id = client.submit_batch(
    ["a.pdf", "b.pdf"],
    file_params={
        "a.pdf": FileParam(pages="1-5"),
        "b.pdf": FileParam(pages="10-20"),
    },
)

Web Crawling

result = client.crawl("https://www.baidu.com")
print(result.markdown)

🔄 `submit()` / `get_batch()` Semantics

This is the part most people get wrong at first:

submit() returns a batch ID
submit_batch() also returns a batch ID
the common async flow is therefore submit(...) -> get_batch(batch_id)
recommends staying on the batch-based flow for async polling

Recommended async flow

batch_id = client.submit("large-report.pdf")

# poll the batch until the first item is done
while True:
    results = client.get_batch(batch_id)
    result = results[0]
    if result.state in ("done", "failed"):
        break

if result.state == "done":
    do_something(result.markdown)

🤖 Integration for AI Agents

The SDK is designed to be easily integrated into LLM workflows. For status updates, you can check result.state and result.progress.

batch_id = client.submit("large-report.pdf")
# ... later ...
result = client.get_batch(batch_id)[0]
if result.state == "done":
    do_something(result.markdown)

📄 License

This project is licensed under the Apache-2.0 License.

🔗 Links

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.5

Apr 10, 2026

This version

0.2.0

Apr 10, 2026

0.1.3

Mar 20, 2026

0.1.1

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_open_sdk-0.2.0.tar.gz (31.6 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mineru_open_sdk-0.2.0-py3-none-any.whl (19.2 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file mineru_open_sdk-0.2.0.tar.gz.

File metadata

Download URL: mineru_open_sdk-0.2.0.tar.gz
Upload date: Apr 10, 2026
Size: 31.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mineru_open_sdk-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8445b21c9464e54dbdffb592d6d69b870eb0b981434065f9abcaa2093db7f2b3`
MD5	`630e892f47e1657804c9ed7578da6f53`
BLAKE2b-256	`6b2ce76a93f8afb5e092c033fbbb68163abfee53da5ad666209afe143e906a6f`

See more details on using hashes here.

File details

Details for the file mineru_open_sdk-0.2.0-py3-none-any.whl.

File metadata

Download URL: mineru_open_sdk-0.2.0-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 19.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mineru_open_sdk-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ff0d51a5561960317e23d76f064b8070ac6d8b3d56f817b76c742974ae5b1f3`
MD5	`3bae2fead9e268a713d7efd1ebe1d6f2`
BLAKE2b-256	`a00e1d9261dde45f9754f4eea3a2f3b1880f884d78d9c5dfdeac940cea9bbbcb`

See more details on using hashes here.

mineru-open-sdk 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MinerU Open API SDK (Python)

🚀 Key Features

📦 Install

🛠️ Quick Start

1. Flash Extract (Fast, No Auth, Markdown-only)

2. Precision Extract (Auth Required)

🧩 Supported Public API

Client lifecycle

Blocking extraction methods

Submit/query methods

Result helpers

Result fields you will usually use

📊 Mode Comparison

⚙️ Defaults And Option Behavior

MinerU(...)

Precision methods

Flash Extract

crawl() / crawl_batch()

📖 Detailed Usage

Precision Extraction Options

Context Manager

Batch Processing

Batch With Per-File Pages

Web Crawling

🔄 submit() / get_batch() Semantics

Recommended async flow

🤖 Integration for AI Agents

📄 License

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`MinerU(...)`

`crawl()` / `crawl_batch()`

🔄 `submit()` / `get_batch()` Semantics