MinerU output linter/fixer — LLM tool-use loop that restructures (never generates) MinerU content_list. Machine-verified fidelity: C_out ⊆ C_in.
Project description
mineru-refine
MinerU 解析结果的后处理器(linter / fixer)。
接收 MinerU 的 content_list(item 对象数组),修掉解析产生的高频结构问题——伪标题、
跨页断句、跨页拆表、混入正文的页眉页脚、LaTeX / 链接残留——返回同 schema 的
content_list,下游零改动。
两条核心承诺:
- 绝不新增一个字:只做削减与重组,输出的每个内容字符都来自输入,由机器逐步校验, 违反即自动回滚(不是靠 prompt 约束 LLM)。
- fail-open:任何异常 / LLM 不可用 → 原样返回输入(
report["failOpen"] == True), 绝不搞崩上游。
本包是 Rust 核心实现的 PyO3 原生绑定,与 JS / Rust / HTTP 版选项和返回值完全同构。
安装
pip install mineru-refine
需要 Python ≥ 3.9。
用法
import json
import mineru_refine
items = json.load(open("content_list.json"))
result = mineru_refine.refine(
items, # content_list(list[dict])
sha256="...", # 可选:源文件 SHA256,提供则启用进程内缓存
max_iterations=None, # 可选:修复循环硬上限,默认随疑点数自适应
concurrency=8, # 可选:并行裁决的疑点数,1 = 严格串行
image_dir="/abs/mineru/out", # 可选:MinerU 产物目录,提供则启用跨页拆表的视觉裁决
fix_ocr_confusion=False, # 可选:opt-in 的 OCR 字符混淆修正层(CE0→CEO 等)
extra_confusion_pairs=None, # 可选:混淆准入名单补充对,如 ["0D"]
rewrite_garbled_tables=False, # 可选:opt-in 的重度乱码表视觉重转写层(需要 image_dir)
)
result["items"] # 清洗后的 content_list(同 schema,未知字段原样透传)
result["report"] # 审计报告:iterations / opCounts / dismissed / removedSpans
# / violations / tokenUsage / failOpen
# (开 fix_ocr_confusion 后另有 confusionFixes 等,见主 README)
删除的每段内容都留痕于 report["removedSpans"](itemId / 原文 / 原因),逐条可审计。
fix_ocr_confusion=True 开启混淆修正层(直接替换,LLM 提案 + 机械闸门),
开启后输出契约从"只删不增"变为双契约——详见主 README 的「混淆修正层」一节。
rewrite_garbled_tables=True 开启重度乱码表的视觉重转写层(机械检测整表认废的表,
Qwen-VL 对照截图逐单元格重转写,全量进 report["tableRewrites"])——详见主 README 的
「乱码表重转写层」一节。
独立工具函数(都不调 LLM):
mineru_refine.render_markdown(items) # items → full.md 文本(确定性重渲染)
mineru_refine.detect_suspects(items) # 仅探测疑点,返回疑点列表
环境变量
| 变量 | 必需 | 用途 |
|---|---|---|
DEEPSEEK_APIKEY |
是 | 文本裁决(DeepSeek)。缺失时 refine 直接 fail-open |
QWEN_APIKEY |
视觉裁决需要 | 跨页拆表的 Qwen-VL 裁决;缺失则该类疑点跳过,表格原样保留 |
库本身不读 .env,请在宿主程序里设置环境变量(或自行加载 .env)。
本地构建
just py-dev # 仓库根:构建 wheel 并装进 bindings/python/.venv
just publish-py # 发布 PyPI:当前平台 wheel + sdist(需 MATURIN_PYPI_TOKEN)
探测器、修复操作集、保真闸门的完整设计文档见 仓库 README。
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mineru_refine-0.9.0.tar.gz.
File metadata
- Download URL: mineru_refine-0.9.0.tar.gz
- Upload date:
- Size: 454.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4e87812069478b303f449e26e0adb5f5e73d1c349df3d23e7ab3bc6ad4944d1
|
|
| MD5 |
6b0c257f0a1fd9d287f05ba5389beab2
|
|
| BLAKE2b-256 |
e7fe637ac58bf0a455772a8280bf6fbbe19846ea9435a16f118ceb181343f830
|
Provenance
The following attestation bundles were made for mineru_refine-0.9.0.tar.gz:
Publisher:
py-release.yml on LcpMarvel/mineru-refine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mineru_refine-0.9.0.tar.gz -
Subject digest:
d4e87812069478b303f449e26e0adb5f5e73d1c349df3d23e7ab3bc6ad4944d1 - Sigstore transparency entry: 1801137543
- Sigstore integration time:
-
Permalink:
LcpMarvel/mineru-refine@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/LcpMarvel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
py-release.yml@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd3e7f6adaeb001405a40ae6d73cc71b80dd633ed6d6400afcb59d787faa2b6b
|
|
| MD5 |
a11c3dd970b8bcbd33aee69c5fc52af5
|
|
| BLAKE2b-256 |
d0ce076837fe61e0a9020b3b7835c087579d08f0b1c6b1c62ec8ce94fcad8fe8
|
Provenance
The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
py-release.yml on LcpMarvel/mineru-refine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
cd3e7f6adaeb001405a40ae6d73cc71b80dd633ed6d6400afcb59d787faa2b6b - Sigstore transparency entry: 1801137677
- Sigstore integration time:
-
Permalink:
LcpMarvel/mineru-refine@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/LcpMarvel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
py-release.yml@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ced58f77f7254c530c861ed3848c40e082bfe575caa11c33f2dd1342e2a7b06b
|
|
| MD5 |
528e4048f6d5cd9bc98e0cd8ed0f9ba6
|
|
| BLAKE2b-256 |
af73fbac9e12f2c7a20dd3170412ab3c4903b11927fb6c71ca9bc76fe7aa3e60
|
Provenance
The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
py-release.yml on LcpMarvel/mineru-refine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
ced58f77f7254c530c861ed3848c40e082bfe575caa11c33f2dd1342e2a7b06b - Sigstore transparency entry: 1801138169
- Sigstore integration time:
-
Permalink:
LcpMarvel/mineru-refine@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/LcpMarvel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
py-release.yml@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aba92c294357f19f32219d2b5e36519de662b2ba0c6e1ff7ba6ad805fa546009
|
|
| MD5 |
2b8e33cb1c42df6ef18a9dc4adad8c74
|
|
| BLAKE2b-256 |
eacd040cad5c3d771198d0d10557f510f8118bd51a7045555d95c853758ac249
|
Provenance
The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl:
Publisher:
py-release.yml on LcpMarvel/mineru-refine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl -
Subject digest:
aba92c294357f19f32219d2b5e36519de662b2ba0c6e1ff7ba6ad805fa546009 - Sigstore transparency entry: 1801137917
- Sigstore integration time:
-
Permalink:
LcpMarvel/mineru-refine@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/LcpMarvel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
py-release.yml@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.9 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e74784e9babb299864dbdf244c53ed27dbe161e29ca02ceb04278e7c124d353
|
|
| MD5 |
efb56cb187ddddc3082d445b3d789dc2
|
|
| BLAKE2b-256 |
79f8006a78c64d6aac512f05e14ad5bfdee327dc6ff9fb8c402504d535b53c45
|
Provenance
The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl:
Publisher:
py-release.yml on LcpMarvel/mineru-refine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl -
Subject digest:
9e74784e9babb299864dbdf244c53ed27dbe161e29ca02ceb04278e7c124d353 - Sigstore transparency entry: 1801138321
- Sigstore integration time:
-
Permalink:
LcpMarvel/mineru-refine@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/LcpMarvel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
py-release.yml@e2995d48c7b07d1b3ee80ff929556448488fa041 -
Trigger Event:
push
-
Statement type: