Skip to main content

MinerU output linter/fixer — LLM tool-use loop that restructures (never generates) MinerU content_list. Machine-verified fidelity: C_out ⊆ C_in.

Project description

mineru-refine

MinerU 解析结果的后处理器(linter / fixer)。

接收 MinerU 的 content_list(item 对象数组),修掉解析产生的高频结构问题——伪标题、 跨页断句、跨页拆表、混入正文的页眉页脚、LaTeX / 链接残留——返回同 schema 的 content_list,下游零改动。

两条核心承诺:

  • 绝不新增一个字:只做削减与重组,输出的每个内容字符都来自输入,由机器逐步校验, 违反即自动回滚(不是靠 prompt 约束 LLM)。
  • fail-open:任何异常 / LLM 不可用 → 原样返回输入(report["failOpen"] == True), 绝不搞崩上游。

本包是 Rust 核心实现的 PyO3 原生绑定,与 JS / Rust / HTTP 版选项和返回值完全同构。

安装

pip install mineru-refine

需要 Python ≥ 3.9。

用法

import json
import mineru_refine

items = json.load(open("content_list.json"))

result = mineru_refine.refine(
    items,                              # content_list(list[dict])
    sha256="...",                       # 可选:源文件 SHA256,提供则启用进程内缓存
    max_iterations=None,                # 可选:修复循环硬上限,默认随疑点数自适应
    concurrency=8,                      # 可选:并行裁决的疑点数,1 = 严格串行
    image_dir="/abs/mineru/out",        # 可选:MinerU 产物目录,提供则启用跨页拆表的视觉裁决
    fix_ocr_confusion=False,            # 可选:opt-in 的 OCR 字符混淆修正层(CE0→CEO 等)
    extra_confusion_pairs=None,         # 可选:混淆准入名单补充对,如 ["0D"]
    rewrite_garbled_tables=False,       # 可选:opt-in 的重度乱码表视觉重转写层(需要 image_dir)
)

result["items"]    # 清洗后的 content_list(同 schema,未知字段原样透传)
result["report"]   # 审计报告:iterations / opCounts / dismissed / removedSpans
                   #          / violations / tokenUsage / failOpen
                   #          (开 fix_ocr_confusion 后另有 confusionFixes 等,见主 README)

删除的每段内容都留痕于 report["removedSpans"](itemId / 原文 / 原因),逐条可审计。 fix_ocr_confusion=True 开启混淆修正层(直接替换,LLM 提案 + 机械闸门), 开启后输出契约从"只删不增"变为双契约——详见主 README 的「混淆修正层」一节。 rewrite_garbled_tables=True 开启重度乱码表的视觉重转写层(机械检测整表认废的表, Qwen-VL 对照截图逐单元格重转写,全量进 report["tableRewrites"])——详见主 README 的 「乱码表重转写层」一节。

独立工具函数(都不调 LLM):

mineru_refine.render_markdown(items)    # items → full.md 文本(确定性重渲染)
mineru_refine.detect_suspects(items)    # 仅探测疑点,返回疑点列表

环境变量

变量 必需 用途
DEEPSEEK_APIKEY 文本裁决(DeepSeek)。缺失时 refine 直接 fail-open
QWEN_APIKEY 视觉裁决需要 跨页拆表的 Qwen-VL 裁决;缺失则该类疑点跳过,表格原样保留

库本身不读 .env,请在宿主程序里设置环境变量(或自行加载 .env)。

本地构建

just py-dev        # 仓库根:构建 wheel 并装进 bindings/python/.venv
just publish-py    # 发布 PyPI:当前平台 wheel + sdist(需 MATURIN_PYPI_TOKEN)

探测器、修复操作集、保真闸门的完整设计文档见 仓库 README

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_refine-0.8.0.tar.gz (444.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

mineru_refine-0.8.0-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

mineru_refine-0.8.0-cp39-abi3-macosx_10_12_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file mineru_refine-0.8.0.tar.gz.

File metadata

  • Download URL: mineru_refine-0.8.0.tar.gz
  • Upload date:
  • Size: 444.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mineru_refine-0.8.0.tar.gz
Algorithm Hash digest
SHA256 dae955155479bdeddb9e5a893552939a1fb7ff0f0b500c4286877a53d53cb295
MD5 622ba8358b2d036fca28e247ba7ed29d
BLAKE2b-256 557edd47a63f2a2ea0fdcb37bfe58204e946a41fb9a2dd5740be14bd1622ad1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.8.0.tar.gz:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a37d4bed934cb0911d7787ac95fda8cd7956698c05304def80a2a19769b9295
MD5 6d8014ebef4fe3bb8ebe56d8f32b9a57
BLAKE2b-256 77222ea9be5afd7f954a3cee190824c3cd5a29d9ef5cce6c7abb8e72c82eab0f

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d9e93b6cec078f405b0d3afdbbd4b992fb1a0b100ba85b8484e3ee7155c6b53d
MD5 d721c44369b5564f8faa5b132f251a77
BLAKE2b-256 0a18a9f6476aae93d10f1677b40f1ade1019137817c7dc6d2fedf6136c89e9b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.8.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.8.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.8.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 63dde68cf6ce87376bef9d1158a24ee03abe1a87bfb6e0a77a4a3d48b60c09aa
MD5 043adcb2606cc5dc9384fb598e69f460
BLAKE2b-256 67e55b6017aebd0ca93c977765864d18d06069695f15f4dd2745f71d86c0e0a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.8.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.8.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.8.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6150acf40cc8abc114df8b757300690a319bdf81b2e798416699ab6b58bb510f
MD5 2feea684fe52e13053c3131722249c34
BLAKE2b-256 9f5e7a45b6c018a05fdc4154ee336cfbdf95d20c89f6233138621dae9bce9cc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.8.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page