Skip to main content

MinerU API Python SDK — one line to turn documents into Markdown

Project description

MinerU Open API SDK (Python)

PyPI version License

中文文档

MinerU Open API SDK is a completely free Python library for the MinerU document extraction service. Turn any document (PDF, Images, Word, PPT, Excel) or Web Page into high-quality Markdown with just one line of code.


🚀 Key Features

  • Completely Free: No hidden costs for document extraction.
  • Flash Extract (No Auth): Extract text instantly without an API token.
  • Precision Extract: Comprehensive extraction with layout preservation, images, and formula support.
  • Batch & Polling Primitives: Blocking methods for simple flows plus submit/query methods for asynchronous workflows.
  • Simple Save Helpers: Save Markdown, HTML, LaTeX, DOCX, or the full extracted zip with built-in helpers.

📦 Install

pip install mineru-open-sdk

🛠️ Quick Start

1. Flash Extract (Fast, No Auth, Markdown-only)

Ideal for quick previews. No token required.

from mineru import MinerU

# No token needed for Flash Extract
client = MinerU()
result = client.flash_extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)

2. Precision Extract (Auth Required)

Supports large files, rich assets (images/tables), and multiple formats.

from mineru import MinerU

# Get your free token from https://mineru.net/apiManage/token
client = MinerU("your-api-token")
result = client.extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)
print(result.images) # Access extracted images

🧩 Supported Public API

Client lifecycle

  • MinerU(token: str | None = None, base_url: str = ..., flash_base_url: str | None = None)
  • client.close()
  • client.set_source("your-app")
  • context manager support: with MinerU(...) as client:

Blocking extraction methods

  • client.extract(...) -> ExtractResult
  • client.extract_batch(...) -> Iterator[ExtractResult]
  • client.crawl(...) -> ExtractResult
  • client.crawl_batch(...) -> Iterator[ExtractResult]
  • client.flash_extract(...) -> ExtractResult

Submit/query methods

  • client.submit(...) -> str
  • client.submit_batch(...) -> str
  • client.get_batch(batch_id) -> list[ExtractResult]
  • client.get_task(task_id) -> ExtractResult

Result helpers

  • result.save_markdown(path, with_images=True)
  • result.save_docx(path)
  • result.save_html(path)
  • result.save_latex(path)
  • result.save_all(dir)
  • image.save(path)

Result fields you will usually use

  • result.state
  • result.progress
  • result.markdown
  • result.images
  • result.content_list
  • result.docx
  • result.html
  • result.latex
  • result.task_id

📊 Mode Comparison

Feature Flash Extract Precision Extract
Auth No Auth Required Auth Required (Token)
Speed Blazing Fast Standard
File Limit Max 10 MB Max 200 MB
Page Limit Max 20 Pages Max 600 Pages
Formats PDF, Images, Docx, PPTx, Excel PDF, Images, Doc/x, Ppt/x, Html
Content Markdown only (Placeholders) Full assets (Images, Tables, Formulas)
Output Markdown MD, Docx, LaTeX, HTML, JSON

⚙️ Defaults And Option Behavior

MinerU(...)

Argument Default Behavior
token None If omitted, the SDK reads MINERU_TOKEN from the environment
base_url https://mineru.net/api/v4 Standard API base URL
flash_base_url SDK default flash URL Override flash API endpoint for testing/private deployments

If neither token nor MINERU_TOKEN is set, the client works in flash-only mode: flash_extract() works, while auth-required methods raise NoAuthClientError.

Precision methods

These defaults apply to extract(), extract_batch(), submit(), submit_batch(), and indirectly to crawl() / crawl_batch() unless noted otherwise.

Option Default Behavior when omitted
model None Auto-infers model: .html/.htm uses "html", everything else uses "vlm"
ocr not set OCR is disabled (API default)
formula not set Formula recognition is enabled (API default)
table not set Table recognition is enabled (API default)
language not set Chinese "ch" (API default)
pages None Full document is processed
extra_formats None Only the default Markdown/JSON payload is returned
file_params None Per-file overrides for batch methods. A dict[str, FileParam] keyed by path/URL, where FileParam has fields pages, ocr, data_id
timeout 300 seconds for single-item methods Max total polling time for extract() / crawl()
timeout 1800 seconds for batch methods Max total polling time for extract_batch() / crawl_batch()

Flash Extract

Option Default Behavior when omitted
language "ch" Default language is Chinese
page_range None Full page range allowed by the flash API
timeout 300 seconds Max total polling time

crawl() / crawl_batch()

  • crawl() is shorthand for extract(url, model="html", ...)
  • crawl_batch() is shorthand for extract_batch(urls, model="html", ...)

📖 Detailed Usage

Precision Extraction Options

result = client.extract(
    "./paper.pdf",
    model="vlm",             # "vlm" | "pipeline" | "html"
    ocr=True,                # Enable OCR for scanned documents
    formula=True,            # Formula recognition
    table=True,              # Table recognition
    language="en",           # "ch" | "en" | etc.
    pages="1-20",            # Page range
    extra_formats=["docx"],  # Export as docx, html, or latex
    timeout=600,
)

result.save_all("./output/") # Save markdown and all assets

Context Manager

from mineru import MinerU

with MinerU("your-api-token") as client:
    result = client.extract("./paper.pdf")
    print(result.markdown)

Batch Processing

# Yields results as they complete
for result in client.extract_batch(["a.pdf", "b.pdf", "c.pdf"]):
    print(f"{result.filename}: Done")

Batch With Per-File Pages

from mineru import FileParam

batch_id = client.submit_batch(
    ["a.pdf", "b.pdf"],
    file_params={
        "a.pdf": FileParam(pages="1-5"),
        "b.pdf": FileParam(pages="10-20"),
    },
)

Web Crawling

result = client.crawl("https://www.baidu.com")
print(result.markdown)

🔄 submit() / get_batch() Semantics

This is the part most people get wrong at first:

  • submit() returns a batch ID
  • submit_batch() also returns a batch ID
  • the common async flow is therefore submit(...) -> get_batch(batch_id)
  • recommends staying on the batch-based flow for async polling

Recommended async flow

batch_id = client.submit("large-report.pdf")

# poll the batch until the first item is done
while True:
    results = client.get_batch(batch_id)
    result = results[0]
    if result.state in ("done", "failed"):
        break

if result.state == "done":
    do_something(result.markdown)

🤖 Integration for AI Agents

The SDK is designed to be easily integrated into LLM workflows. For status updates, you can check result.state and result.progress.

batch_id = client.submit("large-report.pdf")
# ... later ...
result = client.get_batch(batch_id)[0]
if result.state == "done":
    do_something(result.markdown)

📄 License

This project is licensed under the Apache-2.0 License.

🔗 Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_open_sdk-0.2.0.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mineru_open_sdk-0.2.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file mineru_open_sdk-0.2.0.tar.gz.

File metadata

  • Download URL: mineru_open_sdk-0.2.0.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mineru_open_sdk-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8445b21c9464e54dbdffb592d6d69b870eb0b981434065f9abcaa2093db7f2b3
MD5 630e892f47e1657804c9ed7578da6f53
BLAKE2b-256 6b2ce76a93f8afb5e092c033fbbb68163abfee53da5ad666209afe143e906a6f

See more details on using hashes here.

File details

Details for the file mineru_open_sdk-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mineru_open_sdk-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ff0d51a5561960317e23d76f064b8070ac6d8b3d56f817b76c742974ae5b1f3
MD5 3bae2fead9e268a713d7efd1ebe1d6f2
BLAKE2b-256 a00e1d9261dde45f9754f4eea3a2f3b1880f884d78d9c5dfdeac940cea9bbbcb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page