MinerU online-API converter plugin for dikw client import — PDF + DOCX + PPTX + XLSX → markdown.
Project description
dikw-converter-mineru
MinerU online-API converter plugin
for dikw-core's dikw client import. Once installed alongside
dikw-core, running
dikw client import paper.pdf
uploads the PDF to MinerU, waits for it to finish parsing, downloads the
result ZIP, and commits the converted markdown + assets into
<base>/sources/paper/.
Supported formats
The MinerU API claims many formats; the plugin's extensions tuple is
deliberately the subset that fits dikw cleanly:
.pdf— primary use case.docx,.doc.pptx,.ppt.xlsx,.xls
Not enabled in v0.1: image inputs (.png / .jpg etc. — would collide
with dikw-core's asset semantics) and .html (overkill for HTML; use a
lighter local converter). Both may land in v0.2 behind an env flag.
Install
# Once published:
pip install dikw-converter-mineru
# Upgrade later:
pip install --upgrade dikw-converter-mineru
# Pin a specific version:
pip install 'dikw-converter-mineru==0.1.0'
# Uninstall — the entry-point disappears on next discovery.
pip uninstall dikw-converter-mineru
# For local development from this monorepo:
pip install -e packages/dikw-converter-mineru
Changelog
See CHANGELOG.md for the per-release history. Each
GitHub Release also carries the same notes; published wheels and
sdists are attached there for offline / air-gapped installs.
Auth — MinerUAPIKey env var
The plugin reads the MinerU API token from the process environment:
- Explicit constructor param wins:
MineruConverter(api_key="…"). Useful for programmatic use, smoke tests, or scripts where you don't want to rely on shell-level env. - Otherwise
MinerUAPIKey— matches the literal key name on MinerU's user dashboard, so users can paste-and-go. - Otherwise
DIKW_MINERU_API_KEY— dikw-convention fallback for environments that want all plugin secrets to share a single prefix.
The plugin does not auto-load .env — that would force a
python-dotenv dep and surprise users about which file gets loaded
when. Load .env into your shell yourself, e.g.
# uv (cross-platform)
uv run --env-file .env dikw client import paper.pdf
# PowerShell
$env:MinerUAPIKey = ((Get-Content .env | Select-String "^MinerUAPIKey=") -split "=", 2)[1]
dikw client import paper.pdf
# direnv / shell rc are also fine.
Get a token at mineru.net → user menu → API manage. Tokens are JWTs and last roughly 90 days; rotate at expiry.
What it produces
For paper.pdf:
<output_dir>/
├── paper.md # MinerU's full.md, renamed, with image refs rewritten
└── assets/
├── paper.pdf # original input, kept as provenance
└── … # images extracted by MinerU (png/jpg/…)
Image references in the markdown use the wikilink form
(![[assets/figure-1.png|caption]]) for the same reason as
dikw-converter-epub: it survives
filenames containing ( or ), and alt text containing ].
MinerU's internal byproducts (layout.json, *_content_list.json,
*_model.json) are dropped — they're useful to MinerU developers, not
to a dikw user. If you want them in v0.2, file an issue.
Privacy
The MinerU API is hosted. Your file is uploaded to OpenXLab's CDN
and processed in the cloud. Don't import documents that aren't allowed
to leave your machine. If you need local processing, install one of the
local-engine plugins instead (currently in the planning stage:
dikw-converter-pymupdf, dikw-converter-docling).
Quota & limits
- Each MinerU account gets ~1000 pages/day at high priority; beyond that you're downgraded (slower, not failed).
- Hard caps: 200 MB per file, 200 pages per file. The plugin pre-checks the file size and fails with a clear error before any upload if you exceed.
- HTTP 5xx is retried with exponential backoff (up to 3 retries).
- Auth errors (
A0202,A0211) and quota exhaustion (-60018) fast-fail with an actionable message.
Determinism
VLM-based document parsing is not byte-deterministic on the server
side. The plugin compensates by setting cache_tolerance to the
maximum allowed value when submitting; the same input within the cache
window returns the same cached result, so back-to-back imports
of the same file produce identical bytes.
After the cache window lapses, the same file may produce subtly different markdown on a re-run. dikw-core's content-hash skip would then treat it as a new revision and re-chunk + re-embed. This is a documented trade-off of the hosted route; the local-engine plugins (when they land) will be fully deterministic.
Known limitations (v0.1)
- Cannot run offline.
- No
--languageoverride yet (uses MinerU's"ch"default — Chinese+English bilingual). v0.2 will add an env-var knob. - No
is_ocr/enable_table/enable_formulaoverrides yet. - No structured-output (
extra_formats) support.
Tests
uv run pytest packages/dikw-converter-mineru
All tests are unit-level; they mock HTTP via pytest-httpx. No tests
call the real MinerU API (would burn your quota + leak your token into
CI artifacts).
For a real-API smoke test, place a small PDF in the workspace's
scratch/ directory (gitignored) and run a one-off conversion
yourself; see the plugin's plan note if one exists, or just
construct MineruConverter() directly with your token.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dikw_converter_mineru-0.1.0.tar.gz.
File metadata
- Download URL: dikw_converter_mineru-0.1.0.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17390aa194a79aa29c6a88228d442b8a10060f8182140201a3195de24a8306f1
|
|
| MD5 |
c5a5afd1037bf0854bb414063abc0966
|
|
| BLAKE2b-256 |
9ad01f0a38cb2da5a68019ef057baf3599e71564f46436477bba89dcf6a90b08
|
Provenance
The following attestation bundles were made for dikw_converter_mineru-0.1.0.tar.gz:
Publisher:
release.yml on OpenDIKW/dikw-plugins
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dikw_converter_mineru-0.1.0.tar.gz -
Subject digest:
17390aa194a79aa29c6a88228d442b8a10060f8182140201a3195de24a8306f1 - Sigstore transparency entry: 1553337811
- Sigstore integration time:
-
Permalink:
OpenDIKW/dikw-plugins@184ceb9552454096de264522e04dd3698768e8c2 -
Branch / Tag:
refs/tags/dikw-converter-mineru-v0.1.0 - Owner: https://github.com/OpenDIKW
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@184ceb9552454096de264522e04dd3698768e8c2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dikw_converter_mineru-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dikw_converter_mineru-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
166a69c96bd4be757aecf3e7368bbca4bd2fe5722b3423dc1abe55e3db92b22c
|
|
| MD5 |
27af40569f1e27f4825150e490b17991
|
|
| BLAKE2b-256 |
4f02b623c0bad79eadf37cfc34c850be1576fa3b52f5fb01eb80e3d94f071f09
|
Provenance
The following attestation bundles were made for dikw_converter_mineru-0.1.0-py3-none-any.whl:
Publisher:
release.yml on OpenDIKW/dikw-plugins
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dikw_converter_mineru-0.1.0-py3-none-any.whl -
Subject digest:
166a69c96bd4be757aecf3e7368bbca4bd2fe5722b3423dc1abe55e3db92b22c - Sigstore transparency entry: 1553337831
- Sigstore integration time:
-
Permalink:
OpenDIKW/dikw-plugins@184ceb9552454096de264522e04dd3698768e8c2 -
Branch / Tag:
refs/tags/dikw-converter-mineru-v0.1.0 - Owner: https://github.com/OpenDIKW
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@184ceb9552454096de264522e04dd3698768e8c2 -
Trigger Event:
push
-
Statement type: