Skip to main content

MinerU output linter/fixer — LLM tool-use loop that restructures (never generates) MinerU content_list. Machine-verified fidelity: C_out ⊆ C_in.

Project description

mineru-refine

MinerU 解析结果的后处理器(linter / fixer)。

接收 MinerU 的 content_list(item 对象数组),修掉解析产生的高频结构问题——伪标题、 跨页断句、跨页拆表、混入正文的页眉页脚、LaTeX / 链接残留——返回同 schema 的 content_list,下游零改动。

两条核心承诺:

  • 绝不新增一个字:只做削减与重组,输出的每个内容字符都来自输入,由机器逐步校验, 违反即自动回滚(不是靠 prompt 约束 LLM)。
  • fail-open:任何异常 / LLM 不可用 → 原样返回输入(report["failOpen"] == True), 绝不搞崩上游。

本包是 Rust 核心实现的 PyO3 原生绑定,与 JS / Rust / HTTP 版选项和返回值完全同构。

安装

pip install mineru-refine

需要 Python ≥ 3.9。

用法

import json
import mineru_refine

items = json.load(open("content_list.json"))

result = mineru_refine.refine(
    items,                              # content_list(list[dict])
    sha256="...",                       # 可选:源文件 SHA256,提供则启用进程内缓存
    max_iterations=None,                # 可选:修复循环硬上限,默认随疑点数自适应
    concurrency=8,                      # 可选:并行裁决的疑点数,1 = 严格串行
    image_dir="/abs/mineru/out",        # 可选:MinerU 产物目录,提供则启用跨页拆表的视觉裁决
    fix_ocr_confusion=False,            # 可选:opt-in 的 OCR 字符混淆修正层(CE0→CEO 等)
    extra_confusion_pairs=None,         # 可选:混淆准入名单补充对,如 ["0D"]
    rewrite_garbled_tables=False,       # 可选:opt-in 的重度乱码表视觉重转写层(需要 image_dir)
)

result["items"]    # 清洗后的 content_list(同 schema,未知字段原样透传)
result["report"]   # 审计报告:iterations / opCounts / dismissed / removedSpans
                   #          / violations / tokenUsage / failOpen
                   #          (开 fix_ocr_confusion 后另有 confusionFixes 等,见主 README)

删除的每段内容都留痕于 report["removedSpans"](itemId / 原文 / 原因),逐条可审计。 fix_ocr_confusion=True 开启混淆修正层(直接替换,LLM 提案 + 机械闸门), 开启后输出契约从"只删不增"变为双契约——详见主 README 的「混淆修正层」一节。 rewrite_garbled_tables=True 开启重度乱码表的视觉重转写层(机械检测整表认废的表, Qwen-VL 对照截图逐单元格重转写,全量进 report["tableRewrites"])——详见主 README 的 「乱码表重转写层」一节。

独立工具函数(都不调 LLM):

mineru_refine.render_markdown(items)    # items → full.md 文本(确定性重渲染)
mineru_refine.detect_suspects(items)    # 仅探测疑点,返回疑点列表

环境变量

变量 必需 用途
DEEPSEEK_APIKEY 文本裁决(DeepSeek)。缺失时 refine 直接 fail-open
QWEN_APIKEY 视觉裁决需要 跨页拆表的 Qwen-VL 裁决;缺失则该类疑点跳过,表格原样保留

库本身不读 .env,请在宿主程序里设置环境变量(或自行加载 .env)。

本地构建

just py-dev        # 仓库根:构建 wheel 并装进 bindings/python/.venv
just publish-py    # 发布 PyPI:当前平台 wheel + sdist(需 MATURIN_PYPI_TOKEN)

探测器、修复操作集、保真闸门的完整设计文档见 仓库 README

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_refine-0.9.0.tar.gz (454.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file mineru_refine-0.9.0.tar.gz.

File metadata

  • Download URL: mineru_refine-0.9.0.tar.gz
  • Upload date:
  • Size: 454.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mineru_refine-0.9.0.tar.gz
Algorithm Hash digest
SHA256 d4e87812069478b303f449e26e0adb5f5e73d1c349df3d23e7ab3bc6ad4944d1
MD5 6b0c257f0a1fd9d287f05ba5389beab2
BLAKE2b-256 e7fe637ac58bf0a455772a8280bf6fbbe19846ea9435a16f118ceb181343f830

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.9.0.tar.gz:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cd3e7f6adaeb001405a40ae6d73cc71b80dd633ed6d6400afcb59d787faa2b6b
MD5 a11c3dd970b8bcbd33aee69c5fc52af5
BLAKE2b-256 d0ce076837fe61e0a9020b3b7835c087579d08f0b1c6b1c62ec8ce94fcad8fe8

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ced58f77f7254c530c861ed3848c40e082bfe575caa11c33f2dd1342e2a7b06b
MD5 528e4048f6d5cd9bc98e0cd8ed0f9ba6
BLAKE2b-256 af73fbac9e12f2c7a20dd3170412ab3c4903b11927fb6c71ca9bc76fe7aa3e60

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aba92c294357f19f32219d2b5e36519de662b2ba0c6e1ff7ba6ad805fa546009
MD5 2b8e33cb1c42df6ef18a9dc4adad8c74
BLAKE2b-256 eacd040cad5c3d771198d0d10557f510f8118bd51a7045555d95c853758ac249

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9e74784e9babb299864dbdf244c53ed27dbe161e29ca02ceb04278e7c124d353
MD5 efb56cb187ddddc3082d445b3d789dc2
BLAKE2b-256 79f8006a78c64d6aac512f05e14ad5bfdee327dc6ff9fb8c402504d535b53c45

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.9.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page