Skip to main content

MinerU online-API converter plugin for dikw client import — PDF + DOCX + PPTX + XLSX → markdown.

Project description

dikw-converter-mineru

MinerU online-API converter plugin for dikw-core's dikw client import. Once installed alongside dikw-core, running

dikw client import paper.pdf

uploads the PDF to MinerU, waits for it to finish parsing, downloads the result ZIP, and commits the converted markdown + assets into <base>/sources/paper/.

Supported formats

The MinerU API claims many formats; the plugin's extensions tuple is deliberately the subset that fits dikw cleanly:

  • .pdf — primary use case
  • .docx, .doc
  • .pptx, .ppt
  • .xlsx, .xls

Not enabled in v0.1: image inputs (.png / .jpg etc. — would collide with dikw-core's asset semantics) and .html (overkill for HTML; use a lighter local converter). Both may land in v0.2 behind an env flag.

Install

# Once published:
pip install dikw-converter-mineru

# Upgrade later:
pip install --upgrade dikw-converter-mineru

# Pin a specific version:
pip install 'dikw-converter-mineru==0.1.0'

# Uninstall — the entry-point disappears on next discovery.
pip uninstall dikw-converter-mineru

# For local development from this monorepo:
pip install -e packages/dikw-converter-mineru

Changelog

See CHANGELOG.md for the per-release history. Each GitHub Release also carries the same notes; published wheels and sdists are attached there for offline / air-gapped installs.

Auth — MinerUAPIKey env var

The plugin reads the MinerU API token from the process environment:

  1. Explicit constructor param wins: MineruConverter(api_key="…"). Useful for programmatic use, smoke tests, or scripts where you don't want to rely on shell-level env.
  2. Otherwise MinerUAPIKey — matches the literal key name on MinerU's user dashboard, so users can paste-and-go.
  3. Otherwise DIKW_MINERU_API_KEY — dikw-convention fallback for environments that want all plugin secrets to share a single prefix.

The plugin does not auto-load .env — that would force a python-dotenv dep and surprise users about which file gets loaded when. Load .env into your shell yourself, e.g.

# uv (cross-platform)
uv run --env-file .env dikw client import paper.pdf

# PowerShell
$env:MinerUAPIKey = ((Get-Content .env | Select-String "^MinerUAPIKey=") -split "=", 2)[1]
dikw client import paper.pdf

# direnv / shell rc are also fine.

Get a token at mineru.net → user menu → API manage. Tokens are JWTs and last roughly 90 days; rotate at expiry.

What it produces

For paper.pdf:

<output_dir>/
├── paper.md                    # MinerU's full.md, renamed, with image refs rewritten
└── assets/
    ├── paper.pdf               # original input, kept as provenance
    └── …                       # images extracted by MinerU (png/jpg/…)

Image references in the markdown use the wikilink form (![[assets/figure-1.png|caption]]) for the same reason as dikw-converter-epub: it survives filenames containing ( or ), and alt text containing ].

MinerU's internal byproducts (layout.json, *_content_list.json, *_model.json) are dropped — they're useful to MinerU developers, not to a dikw user. If you want them in v0.2, file an issue.

Privacy

The MinerU API is hosted. Your file is uploaded to OpenXLab's CDN and processed in the cloud. Don't import documents that aren't allowed to leave your machine. If you need local processing, install one of the local-engine plugins instead (currently in the planning stage: dikw-converter-pymupdf, dikw-converter-docling).

Quota & limits

  • Each MinerU account gets ~1000 pages/day at high priority; beyond that you're downgraded (slower, not failed).
  • Hard caps: 200 MB per file, 200 pages per file. The plugin pre-checks the file size and fails with a clear error before any upload if you exceed.
  • HTTP 5xx is retried with exponential backoff (up to 3 retries).
  • Auth errors (A0202, A0211) and quota exhaustion (-60018) fast-fail with an actionable message.

Determinism

VLM-based document parsing is not byte-deterministic on the server side. The plugin compensates by setting cache_tolerance to the maximum allowed value when submitting; the same input within the cache window returns the same cached result, so back-to-back imports of the same file produce identical bytes.

After the cache window lapses, the same file may produce subtly different markdown on a re-run. dikw-core's content-hash skip would then treat it as a new revision and re-chunk + re-embed. This is a documented trade-off of the hosted route; the local-engine plugins (when they land) will be fully deterministic.

Known limitations (v0.1)

  • Cannot run offline.
  • No --language override yet (uses MinerU's "ch" default — Chinese+English bilingual). v0.2 will add an env-var knob.
  • No is_ocr / enable_table / enable_formula overrides yet.
  • No structured-output (extra_formats) support.

Tests

uv run pytest packages/dikw-converter-mineru

All tests are unit-level; they mock HTTP via pytest-httpx. No tests call the real MinerU API (would burn your quota + leak your token into CI artifacts).

For a real-API smoke test, place a small PDF in the workspace's scratch/ directory (gitignored) and run a one-off conversion yourself; see the plugin's plan note if one exists, or just construct MineruConverter() directly with your token.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dikw_converter_mineru-0.1.0.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dikw_converter_mineru-0.1.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file dikw_converter_mineru-0.1.0.tar.gz.

File metadata

  • Download URL: dikw_converter_mineru-0.1.0.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dikw_converter_mineru-0.1.0.tar.gz
Algorithm Hash digest
SHA256 17390aa194a79aa29c6a88228d442b8a10060f8182140201a3195de24a8306f1
MD5 c5a5afd1037bf0854bb414063abc0966
BLAKE2b-256 9ad01f0a38cb2da5a68019ef057baf3599e71564f46436477bba89dcf6a90b08

See more details on using hashes here.

Provenance

The following attestation bundles were made for dikw_converter_mineru-0.1.0.tar.gz:

Publisher: release.yml on OpenDIKW/dikw-plugins

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dikw_converter_mineru-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dikw_converter_mineru-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 166a69c96bd4be757aecf3e7368bbca4bd2fe5722b3423dc1abe55e3db92b22c
MD5 27af40569f1e27f4825150e490b17991
BLAKE2b-256 4f02b623c0bad79eadf37cfc34c850be1576fa3b52f5fb01eb80e3d94f071f09

See more details on using hashes here.

Provenance

The following attestation bundles were made for dikw_converter_mineru-0.1.0-py3-none-any.whl:

Publisher: release.yml on OpenDIKW/dikw-plugins

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page