Skip to main content

MinerU output linter/fixer — LLM tool-use loop that restructures (never generates) MinerU content_list. Machine-verified fidelity: C_out ⊆ C_in.

Project description

mineru-refine

MinerU 解析结果的后处理器(linter / fixer)。

接收 MinerU 的 content_list(item 对象数组),修掉解析产生的高频结构问题——伪标题、 跨页断句、跨页拆表、混入正文的页眉页脚、LaTeX / 链接残留——返回同 schema 的 content_list,下游零改动。

两条核心承诺:

  • 绝不新增一个字:只做削减与重组,输出的每个内容字符都来自输入,由机器逐步校验, 违反即自动回滚(不是靠 prompt 约束 LLM)。
  • fail-open:任何异常 / LLM 不可用 → 原样返回输入(report["failOpen"] == True), 绝不搞崩上游。

本包是 Rust 核心实现的 PyO3 原生绑定,与 JS / Rust / HTTP 版选项和返回值完全同构。

安装

pip install mineru-refine

需要 Python ≥ 3.9。

用法

import json
import mineru_refine

items = json.load(open("content_list.json"))

result = mineru_refine.refine(
    items,                              # content_list(list[dict])
    sha256="...",                       # 可选:源文件 SHA256,提供则启用进程内缓存
    max_iterations=None,                # 可选:修复循环硬上限,默认随疑点数自适应
    concurrency=8,                      # 可选:并行裁决的疑点数,1 = 严格串行
    image_dir="/abs/mineru/out",        # 可选:MinerU 产物目录,提供则启用跨页拆表的视觉裁决
)

result["items"]    # 清洗后的 content_list(同 schema,未知字段原样透传)
result["report"]   # 审计报告:iterations / opCounts / dismissed / removedSpans
                   #          / violations / tokenUsage / failOpen

删除的每段内容都留痕于 report["removedSpans"](itemId / 原文 / 原因),逐条可审计。

独立工具函数(都不调 LLM):

mineru_refine.render_markdown(items)    # items → full.md 文本(确定性重渲染)
mineru_refine.detect_suspects(items)    # 仅探测疑点,返回疑点列表

环境变量

变量 必需 用途
DEEPSEEK_APIKEY 文本裁决(DeepSeek)。缺失时 refine 直接 fail-open
QWEN_APIKEY 视觉裁决需要 跨页拆表的 Qwen-VL 裁决;缺失则该类疑点跳过,表格原样保留

库本身不读 .env,请在宿主程序里设置环境变量(或自行加载 .env)。

本地构建

just py-dev        # 仓库根:构建 wheel 并装进 bindings/python/.venv
just publish-py    # 发布 PyPI:当前平台 wheel + sdist(需 MATURIN_PYPI_TOKEN)

探测器、修复操作集、保真闸门的完整设计文档见 仓库 README

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_refine-0.7.0.tar.gz (83.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

mineru_refine-0.7.0-cp39-abi3-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

mineru_refine-0.7.0-cp39-abi3-macosx_10_12_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file mineru_refine-0.7.0.tar.gz.

File metadata

  • Download URL: mineru_refine-0.7.0.tar.gz
  • Upload date:
  • Size: 83.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mineru_refine-0.7.0.tar.gz
Algorithm Hash digest
SHA256 1e656d383577c64b8e4c27e3de14b6a1a0271dfa9313e311e12b06501387f4a9
MD5 b0c2da9c3b1e2f232e93dd52c6f1312d
BLAKE2b-256 148decd271ffb0a04557802b2f2525a0a8f9f0cce04529d71a5e91d78c46951c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.0.tar.gz:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dfadc11753f84fc722a4ac82b91b29511a8c59f6a09f82fc677843938305573c
MD5 42e77c1076e82a78c2aa07ba31d7e5d1
BLAKE2b-256 44e79c32516e5d9856d376f329b13477ad50b1a0c9769ff1e828646e2c060c89

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5bd5e5f22f17183c05992a2133292cfa0dea0dd3da65f44871ac964c7a81ccff
MD5 6048056ca50e4e254eb86b2966d17bb9
BLAKE2b-256 5cb66222f28b3bb629c9c603fd7acfc0dec5097f61e33cccc632d1d979063550

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7ecb6b250485ca5dd58d827b7165ffc1ce334f32f1f15cc859d3306dbd931f93
MD5 b16efe6cc0a6b55c71b2e276d84c4dd4
BLAKE2b-256 e0e569d38b5cedc4602c018066a9bb6a8a2ef51bf612d205701bed2f571ab090

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a8d47c36c45074793e2339ae30da8df25c24960313551974f62bff673bdda15b
MD5 d794253e247710f3ebb3331957b8f722
BLAKE2b-256 1335acb542ba8a00e88f7e06906e289e33e8d57771ccc1a2ccc5a25b8b731d9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page