Skip to main content

MinerU API Python SDK — one line to turn documents into Markdown

Project description

MinerU Open API SDK (Python)

PyPI version License

中文文档

MinerU Open API SDK is a completely free Python library for the MinerU document extraction service. Turn any document (PDF, Images, Word, PPT, Excel) or Web Page into high-quality Markdown with just one line of code.


🚀 Key Features

  • Completely Free: No hidden costs for document extraction.
  • Flash Mode (No Auth): Extract text instantly without an API token.
  • Full Feature Mode: Comprehensive extraction with layout preservation, images, and formula support.
  • Batch & Polling Primitives: Blocking methods for simple flows plus submit/query methods for asynchronous workflows.
  • Simple Save Helpers: Save Markdown, HTML, LaTeX, DOCX, or the full extracted zip with built-in helpers.

📦 Install

pip install mineru-open-sdk

🛠️ Quick Start

1. Flash Extract (Fast, No Auth, Markdown-only)

Ideal for quick previews. No token required.

from mineru import MinerU

# No token needed for Flash Mode
client = MinerU()
result = client.flash_extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)

2. Full Feature Extract (Auth Required)

Supports large files, rich assets (images/tables), and multiple formats.

from mineru import MinerU

# Get your free token from https://mineru.net
client = MinerU("your-api-token")
result = client.extract("https://cdn-mineru.openxlab.org.cn/demo/example.pdf")

print(result.markdown)
print(result.images) # Access extracted images

🧩 Supported Public API

Client lifecycle

  • MinerU(token: str | None = None, base_url: str = ..., flash_base_url: str | None = None)
  • client.close()
  • client.set_source("your-app")
  • context manager support: with MinerU(...) as client:

Blocking extraction methods

  • client.extract(...) -> ExtractResult
  • client.extract_batch(...) -> Iterator[ExtractResult]
  • client.crawl(...) -> ExtractResult
  • client.crawl_batch(...) -> Iterator[ExtractResult]
  • client.flash_extract(...) -> ExtractResult

Submit/query methods

  • client.submit(...) -> str
  • client.submit_batch(...) -> str
  • client.get_batch(batch_id) -> list[ExtractResult]
  • client.get_task(task_id) -> ExtractResult

Result helpers

  • result.save_markdown(path, with_images=True)
  • result.save_docx(path)
  • result.save_html(path)
  • result.save_latex(path)
  • result.save_all(dir)
  • image.save(path)

Result fields you will usually use

  • result.state
  • result.progress
  • result.markdown
  • result.images
  • result.content_list
  • result.docx
  • result.html
  • result.latex
  • result.task_id

📊 Mode Comparison

Feature Flash Extract Full Feature Extract
Auth No Auth Required Auth Required (Token)
Speed Blazing Fast Standard
File Limit Max 10 MB Max 200 MB
Page Limit Max 20 Pages Max 600 Pages
Formats PDF, Images, Docx, PPTx, Excel PDF, Images, Doc/x, Ppt/x, Html
Content Markdown only (Placeholders) Full assets (Images, Tables, Formulas)
Output Markdown MD, Docx, LaTeX, HTML, JSON

⚙️ Defaults And Option Behavior

MinerU(...)

Argument Default Behavior
token None If omitted, the SDK reads MINERU_TOKEN from the environment
base_url https://mineru.net/api/v4 Standard API base URL
flash_base_url SDK default flash URL Override flash API endpoint for testing/private deployments

If neither token nor MINERU_TOKEN is set, the client works in flash-only mode: flash_extract() works, while auth-required methods raise NoAuthClientError.

Full-feature methods

These defaults apply to extract(), extract_batch(), submit(), submit_batch(), and indirectly to crawl() / crawl_batch() unless noted otherwise.

Option Default Behavior when omitted
model None Auto-infers model: .html/.htm uses "html", everything else uses "vlm"
ocr not set OCR is disabled (API default)
formula not set Formula recognition is enabled (API default)
table not set Table recognition is enabled (API default)
language not set Chinese "ch" (API default)
pages None Full document is processed
extra_formats None Only the default Markdown/JSON payload is returned
file_params None Per-file overrides for batch methods. A dict[str, FileParam] keyed by path/URL, where FileParam has fields pages, ocr, data_id
timeout 300 seconds for single-item methods Max total polling time for extract() / crawl()
timeout 1800 seconds for batch methods Max total polling time for extract_batch() / crawl_batch()

Flash mode

Option Default Behavior when omitted
language "ch" Default language is Chinese
page_range None Full page range allowed by the flash API
timeout 300 seconds Max total polling time

crawl() / crawl_batch()

  • crawl() is shorthand for extract(url, model="html", ...)
  • crawl_batch() is shorthand for extract_batch(urls, model="html", ...)

📖 Detailed Usage

Full Feature Extraction Options

result = client.extract(
    "./paper.pdf",
    model="vlm",             # "vlm" | "pipeline" | "html"
    ocr=True,                # Enable OCR for scanned documents
    formula=True,            # Formula recognition
    table=True,              # Table recognition
    language="en",           # "ch" | "en" | etc.
    pages="1-20",            # Page range
    extra_formats=["docx"],  # Export as docx, html, or latex
    timeout=600,
)

result.save_all("./output/") # Save markdown and all assets

Context Manager

from mineru import MinerU

with MinerU("your-api-token") as client:
    result = client.extract("./paper.pdf")
    print(result.markdown)

Batch Processing

# Yields results as they complete
for result in client.extract_batch(["a.pdf", "b.pdf", "c.pdf"]):
    print(f"{result.filename}: Done")

Batch With Per-File Pages

from mineru import FileParam

batch_id = client.submit_batch(
    ["a.pdf", "b.pdf"],
    file_params={
        "a.pdf": FileParam(pages="1-5"),
        "b.pdf": FileParam(pages="10-20"),
    },
)

Web Crawling

result = client.crawl("https://www.baidu.com")
print(result.markdown)

🔄 submit() / get_batch() Semantics

This is the part most people get wrong at first:

  • submit() returns a batch ID
  • submit_batch() also returns a batch ID
  • the common async flow is therefore submit(...) -> get_batch(batch_id)
  • recommends staying on the batch-based flow for async polling

Recommended async flow

batch_id = client.submit("large-report.pdf")

# poll the batch until the first item is done
while True:
    results = client.get_batch(batch_id)
    result = results[0]
    if result.state in ("done", "failed"):
        break

if result.state == "done":
    do_something(result.markdown)

🤖 Integration for AI Agents

The SDK is designed to be easily integrated into LLM workflows. For status updates, you can check result.state and result.progress.

batch_id = client.submit("large-report.pdf")
# ... later ...
result = client.get_batch(batch_id)[0]
if result.state == "done":
    do_something(result.markdown)

📄 License

This project is licensed under the Apache-2.0 License.

🔗 Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_open_sdk-0.1.1.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mineru_open_sdk-0.1.1-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file mineru_open_sdk-0.1.1.tar.gz.

File metadata

  • Download URL: mineru_open_sdk-0.1.1.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mineru_open_sdk-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b8ed69e83b189eb38b7ca7435cd940ea0586fb6c4d85d397fb42a31790417afa
MD5 2b69073740e6c056ba24c84ce5422b7d
BLAKE2b-256 275bc4bb99855553f2e2b739060ec7b46510eb7004d48ae665c20caf468b2a72

See more details on using hashes here.

File details

Details for the file mineru_open_sdk-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mineru_open_sdk-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mineru_open_sdk-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 06a57df316fa9d2e1cd80fcd4a00f8cdc7d833ad1a743f4c68d6abc7191f205b
MD5 4e672f616604d332c7770e5864c7b70a
BLAKE2b-256 891feb93bcd55f13b59305f9d622ead28b054e5ea5440e1b3cdd22ac3d916e86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page