Skip to main content

MinerU API Python SDK — one line to turn documents into Markdown

Project description

MinerU Open API SDK (Python)

PyPI version License

中文文档

MinerU Open API SDK is a completely free Python library for the MinerU document extraction service. Turn any document (PDF, Images, Word, PPT, Excel) or Web Page into high-quality Markdown with just one line of code.


🚀 Key Features

  • Completely Free: No hidden costs for document extraction.
  • Flash Extract (No Auth): Extract text instantly without an API token.
  • Precision Extract: Comprehensive extraction with layout preservation, images, and formula support.
  • Batch & Polling Primitives: Blocking methods for simple flows plus submit/query methods for asynchronous workflows.
  • Simple Save Helpers: Save Markdown, HTML, LaTeX, DOCX, or the full extracted zip with built-in helpers.

📦 Install

pip install mineru-open-sdk

🛠️ Quick Start

1. Flash Extract (Fast, No Auth, Markdown-only)

Ideal for quick previews. No token required.

from mineru import MinerU

# No token needed for Flash Extract
client = MinerU()
result = client.flash_extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)

2. Precision Extract (Auth Required)

Supports large files, rich assets (images/tables), and multiple formats.

from mineru import MinerU

# Get your free token from https://mineru.net/apiManage/token
client = MinerU("your-api-token")
result = client.extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)
print(result.images) # Access extracted images

🧩 Supported Public API

Client lifecycle

  • MinerU(token: str | None = None, base_url: str = ..., flash_base_url: str | None = None)
  • client.close()
  • client.set_source("your-app")
  • context manager support: with MinerU(...) as client:

Blocking extraction methods

  • client.extract(...) -> ExtractResult
  • client.extract_batch(...) -> Iterator[ExtractResult]
  • client.crawl(...) -> ExtractResult
  • client.crawl_batch(...) -> Iterator[ExtractResult]
  • client.flash_extract(...) -> ExtractResult

Submit/query methods

  • client.submit(...) -> str
  • client.submit_batch(...) -> str
  • client.get_batch(batch_id) -> list[ExtractResult]
  • client.get_task(task_id) -> ExtractResult

Result helpers

  • result.save_markdown(path, with_images=True)
  • result.save_docx(path)
  • result.save_html(path)
  • result.save_latex(path)
  • result.save_all(dir)
  • image.save(path)

Result fields you will usually use

  • result.state
  • result.progress
  • result.markdown
  • result.images
  • result.content_list
  • result.docx
  • result.html
  • result.latex
  • result.task_id

📊 Mode Comparison

Feature Flash Extract Precision Extract
Auth No Auth Required Auth Required (Token)
Speed Blazing Fast Standard
File Limit Max 10 MB Max 200 MB
Page Limit Max 20 Pages Max 600 Pages
Formats PDF, Images, Docx, PPTx, Excel PDF, Images, Doc/x, Ppt/x, Html
Content Markdown only (Placeholders) Full assets (Images, Tables, Formulas)
Output Markdown MD, Docx, LaTeX, HTML, JSON

⚙️ Defaults And Option Behavior

MinerU(...)

Argument Default Behavior
token None If omitted, the SDK reads MINERU_TOKEN from the environment
base_url https://mineru.net/api/v4 Standard API base URL
flash_base_url SDK default flash URL Override flash API endpoint for testing/private deployments

If neither token nor MINERU_TOKEN is set, the client works in flash-only mode: flash_extract() works, while auth-required methods raise NoAuthClientError.

Precision methods

These defaults apply to extract(), extract_batch(), submit(), submit_batch(), and indirectly to crawl() / crawl_batch() unless noted otherwise.

Option Default Behavior when omitted
model None Auto-infers model: .html/.htm uses "html", everything else uses "vlm"
ocr not set OCR is disabled (API default)
formula not set Formula recognition is enabled (API default)
table not set Table recognition is enabled (API default)
language not set Chinese "ch" (API default)
pages None Full document is processed
extra_formats None Only the default Markdown/JSON payload is returned
file_params None Per-file overrides for batch methods. A dict[str, FileParam] keyed by path/URL, where FileParam has fields pages, ocr, data_id
timeout 300 seconds for single-item methods Max total polling time for extract() / crawl()
timeout 1800 seconds for batch methods Max total polling time for extract_batch() / crawl_batch()

Flash Extract

Option Default Behavior when omitted
language "ch" Default language is Chinese
page_range None Full page range allowed by the flash API
timeout 300 seconds Max total polling time

crawl() / crawl_batch()

  • crawl() is shorthand for extract(url, model="html", ...)
  • crawl_batch() is shorthand for extract_batch(urls, model="html", ...)

📖 Detailed Usage

Precision Extraction Options

result = client.extract(
    "./paper.pdf",
    model="vlm",             # "vlm" | "pipeline" | "html"
    ocr=True,                # Enable OCR for scanned documents
    formula=True,            # Formula recognition
    table=True,              # Table recognition
    language="en",           # "ch" | "en" | etc.
    pages="1-20",            # Page range
    extra_formats=["docx"],  # Export as docx, html, or latex
    timeout=600,
)

result.save_all("./output/") # Save markdown and all assets

Context Manager

from mineru import MinerU

with MinerU("your-api-token") as client:
    result = client.extract("./paper.pdf")
    print(result.markdown)

Batch Processing

# Yields results as they complete
for result in client.extract_batch(["a.pdf", "b.pdf", "c.pdf"]):
    print(f"{result.filename}: Done")

Batch With Per-File Pages

from mineru import FileParam

batch_id = client.submit_batch(
    ["a.pdf", "b.pdf"],
    file_params={
        "a.pdf": FileParam(pages="1-5"),
        "b.pdf": FileParam(pages="10-20"),
    },
)

Web Crawling

result = client.crawl("https://www.baidu.com")
print(result.markdown)

🔄 submit() / get_batch() Semantics

This is the part most people get wrong at first:

  • submit() returns a batch ID
  • submit_batch() also returns a batch ID
  • the common async flow is therefore submit(...) -> get_batch(batch_id)
  • recommends staying on the batch-based flow for async polling

Recommended async flow

batch_id = client.submit("large-report.pdf")

# poll the batch until the first item is done
while True:
    results = client.get_batch(batch_id)
    result = results[0]
    if result.state in ("done", "failed"):
        break

if result.state == "done":
    do_something(result.markdown)

🤖 Integration for AI Agents

The SDK is designed to be easily integrated into LLM workflows. For status updates, you can check result.state and result.progress.

batch_id = client.submit("large-report.pdf")
# ... later ...
result = client.get_batch(batch_id)[0]
if result.state == "done":
    do_something(result.markdown)

📄 License

This project is licensed under the Apache-2.0 License.

🔗 Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_open_sdk-0.1.3.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mineru_open_sdk-0.1.3-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file mineru_open_sdk-0.1.3.tar.gz.

File metadata

  • Download URL: mineru_open_sdk-0.1.3.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mineru_open_sdk-0.1.3.tar.gz
Algorithm Hash digest
SHA256 4995b050d799c9d360f6f09093d3b2d3f9660916f7ecdd4f1576487b70f5d2e4
MD5 3a95b28bb7ce08f71c5cbaa7c730c0c4
BLAKE2b-256 d63c26cfee398ab946531339434b5871d24cfc40cbfd768f2e4acf66ba5a95e1

See more details on using hashes here.

File details

Details for the file mineru_open_sdk-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for mineru_open_sdk-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2ce84f96ac58bcd1f073791b18606f6fab58753b1a2436ad57f900fb0155e3da
MD5 49864e2822ee4a7daeb9e26841591ca3
BLAKE2b-256 8ca11ab467718e1c97976924a82eb23fc7744734d7aa83dae156a969f69150fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page