Skip to main content

MinerU output linter/fixer — LLM tool-use loop that restructures (never generates) MinerU content_list. Machine-verified fidelity: C_out ⊆ C_in.

Project description

mineru-refine

MinerU 解析结果的后处理器(linter / fixer)。

接收 MinerU 的 content_list(item 对象数组),修掉解析产生的高频结构问题——伪标题、 跨页断句、跨页拆表、混入正文的页眉页脚、LaTeX / 链接残留——返回同 schema 的 content_list,下游零改动。

两条核心承诺:

  • 绝不新增一个字:只做削减与重组,输出的每个内容字符都来自输入,由机器逐步校验, 违反即自动回滚(不是靠 prompt 约束 LLM)。
  • fail-open:任何异常 / LLM 不可用 → 原样返回输入(report["failOpen"] == True), 绝不搞崩上游。

本包是 Rust 核心实现的 PyO3 原生绑定,与 JS / Rust / HTTP 版选项和返回值完全同构。

安装

pip install mineru-refine

需要 Python ≥ 3.9。

用法

import json
import mineru_refine

items = json.load(open("content_list.json"))

result = mineru_refine.refine(
    items,                              # content_list(list[dict])
    sha256="...",                       # 可选:源文件 SHA256,提供则启用进程内缓存
    max_iterations=None,                # 可选:修复循环硬上限,默认随疑点数自适应
    concurrency=8,                      # 可选:并行裁决的疑点数,1 = 严格串行
    image_dir="/abs/mineru/out",        # 可选:MinerU 产物目录,提供则启用跨页拆表的视觉裁决
)

result["items"]    # 清洗后的 content_list(同 schema,未知字段原样透传)
result["report"]   # 审计报告:iterations / opCounts / dismissed / removedSpans
                   #          / violations / tokenUsage / failOpen

删除的每段内容都留痕于 report["removedSpans"](itemId / 原文 / 原因),逐条可审计。

独立工具函数(都不调 LLM):

mineru_refine.render_markdown(items)    # items → full.md 文本(确定性重渲染)
mineru_refine.detect_suspects(items)    # 仅探测疑点,返回疑点列表

环境变量

变量 必需 用途
DEEPSEEK_APIKEY 文本裁决(DeepSeek)。缺失时 refine 直接 fail-open
QWEN_APIKEY 视觉裁决需要 跨页拆表的 Qwen-VL 裁决;缺失则该类疑点跳过,表格原样保留

库本身不读 .env,请在宿主程序里设置环境变量(或自行加载 .env)。

本地构建

just py-dev        # 仓库根:构建 wheel 并装进 bindings/python/.venv
just publish-py    # 发布 PyPI:当前平台 wheel + sdist(需 MATURIN_PYPI_TOKEN)

探测器、修复操作集、保真闸门的完整设计文档见 仓库 README

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_refine-0.7.1.tar.gz (83.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

mineru_refine-0.7.1-cp39-abi3-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

mineru_refine-0.7.1-cp39-abi3-macosx_10_12_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file mineru_refine-0.7.1.tar.gz.

File metadata

  • Download URL: mineru_refine-0.7.1.tar.gz
  • Upload date:
  • Size: 83.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mineru_refine-0.7.1.tar.gz
Algorithm Hash digest
SHA256 0f88e4d23e943f8aeea12f299ca231abc9f18e4c97028272b285924441c458e7
MD5 1fb2d5af9f4b121ee39fff04cbd1f39a
BLAKE2b-256 9ad43797fd23e4c1abca5fde413463833091884271aff3c26dfefe48c6da520d

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.1.tar.gz:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8ff316c10e09b9a4b7516ffa4a3037fb68455ff771f5a3ac43ec56af4c5aab9d
MD5 ea1aac12117fd2d90d2c4797f3ccd1e6
BLAKE2b-256 146cb003c7fc9577cc51a1a518a07c84dff99ef80cd49e135024ee6979ba484b

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2c6c1866084372047b34e74d34d88ad3f78f04d77a8a99d2e5de1f4090aeae49
MD5 b62de160280fb3d6706be282b77cba01
BLAKE2b-256 25d033e14284c786f0f103f59c4803b32fa917e228b0557b768ca9a140be998a

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aa2319d0db16397955e50096c5e6da241f95f3e08e086da0de7a53c779ff1b29
MD5 2758411a2d795edba4289099d6ac3b07
BLAKE2b-256 88e974954ac954818570fcff073f1e4ee93fbef8f475aeab4819c06cb71abd61

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.1-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mineru_refine-0.7.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mineru_refine-0.7.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 675c023f0af068b40e15d6a745461879533a88af4078c5ddabe98abdc756e5f0
MD5 9e053f88693abdfd4571f8402965c1d6
BLAKE2b-256 85730ad0a8c2eb923e334dcde87aac231d1561b524e2537c394fbdf6c8c1e6c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for mineru_refine-0.7.1-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: py-release.yml on LcpMarvel/mineru-refine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page