Skip to main content

PDF layout-preserving translation based on MinerU Online API output

Project description

mineru-translate

Layout-preserving PDF translation powered by MinerU Online API.

Translates PDFs while keeping block positions, fonts, images, tables, and equations in place. Supports DeepL, OpenAI-compatible LLMs, and Anthropic as translation backends.

The layout engine is still being improved. If you run into poor rendering quality or obvious layout issues, please open an issue or send feedback with sample files/screenshots.

中文说明


How It Works

PDF ──► [MinerU Online API] ──► layout.json ──► mineru-translate ──► translated.pdf

mineru-translate relies on MinerU Online API to parse PDFs. MinerU performs OCR, layout analysis, table detection, and equation extraction, producing a structured layout.json that mineru-translate then translates and re-renders.

You can either:

  • Pass a layout.json you already have (-l layout.json)
  • Let mineru-translate call the MinerU Online API automatically (--mineru-token $MINERU_TOKEN)

MinerU resources:


Preview

Original Translated
Original PDF page Translated PDF page

Features

  • Layout fidelity — block positions, page sizes, font styles, and reading order are preserved
  • Adaptive font sizing — automatically shrinks fonts and expands into surrounding whitespace when translated text is longer
  • Image & equation preservation — images rendered at original positions; interline equations use source images; inline equations converted to Unicode (Δ, r², Q₅₀)
  • HTML table translation — table structure (rowspan / colspan) maintained via ReportLab
  • Rotation support — correctly handles 90° / 270° rotated blocks and tables
  • Overlap resolution — per-page cascade algorithm detects and resolves block collisions
  • Translation cache — file-based JSON cache avoids re-translating unchanged content
  • Multiple providers — DeepL, OpenAI-compatible LLMs, Anthropic

Installation

pip install mineru-translate

System fonts: CJK output requires Noto Sans CJK SC. On Windows, SimHei and SimSun are used automatically.

Quick Start

# Auto-parse via MinerU Online API, then translate (recommended)
mineru-translate doc.pdf -t zh --provider deepl --api-key $DEEPL_API_KEY \
  --mineru-token $MINERU_TOKEN

# Use a pre-parsed layout.json (skip MinerU Online API call)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl --api-key $DEEPL_API_KEY

# Output to a specific directory
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
  --api-key $DEEPL_API_KEY -o /output/dir/

# Export as Markdown
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
  --api-key $DEEPL_API_KEY --format md -o /output/dir/

# Export bilingual Markdown (original + translation)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
  --api-key $DEEPL_API_KEY --format md --md-mode bilingual

CLI Options

mineru-translate PDF [OPTIONS]
Option Short Default Description
--layout -l Path to pre-parsed MinerU layout.json (skips MinerU Online API)
--output -o same dir as input Output directory or file path
--format -f pdf Output format: pdf or md
--source -s auto Source language code (auto = detect)
--target -t zh Target language code (required; see table below)
--provider -p openai Translation provider: openai / anthropic / deepl
--model -m provider default Model name (LLM providers)
--api-key -k env var API key
--base-url Custom base URL for OpenAI-compatible endpoints
--concurrency -c 8 Concurrent translation requests
--mineru-token $MINERU_TOKEN MinerU Online API token (required when --layout is not provided)
--md-mode translated Markdown mode: translated (translation only) / bilingual (original + translation)
--keep-cache Keep translation cache and parsed directory (useful for debugging)
--verbose -v Info-level logging
--debug -d Debug-level logging

Environment Variables

Variable Used by
OPENAI_API_KEY --provider openai
ANTHROPIC_API_KEY --provider anthropic
DEEPL_API_KEY --provider deepl
MINERU_TOKEN --mineru-token

Supported Languages

Both --source and --target accept BCP 47 language codes. --source additionally accepts auto (default) for automatic detection. --target does not accept auto.

Code Language Code Language
zh Chinese (Simplified) en English
zh-hans Chinese (Simplified) zh-hant Chinese (Traditional)
zh-cn Chinese (Mainland) zh-tw Chinese (Taiwan)
zh-hk Chinese (Hong Kong) ja Japanese
ko Korean ar Arabic
de German de-de German (Germany)
de-at German (Austria) de-ch German (Switzerland)
fr French fr-fr French (France)
fr-ca French (Canada) es Spanish
es-es Spanish (Spain) es-mx Spanish (Mexico)
pt Portuguese pt-br Portuguese (Brazil)
pt-pt Portuguese (Portugal) ru Russian
it Italian nl Dutch
pl Polish sv Swedish
da Danish nb Norwegian Bokmål
nn Norwegian Nynorsk fi Finnish
cs Czech sk Slovak
hu Hungarian ro Romanian
bg Bulgarian hr Croatian
sl Slovenian sr Serbian
el Greek tr Turkish
uk Ukrainian he Hebrew
fa Persian hi Hindi
bn Bengali th Thai
vi Vietnamese id Indonesian
ms Malay ca Catalan
af Afrikaans sw Swahili
eo Esperanto ga Irish
cy Welsh is Icelandic
mk Macedonian sq Albanian
hy Armenian ka Georgian
et Estonian lv Latvian
lt Lithuanian mt Maltese
gl Galician ur Urdu
ml Malayalam kn Kannada
gu Gujarati mr Marathi
ta Tamil te Telugu

Note: DeepL supports a subset of the above codes. For unsupported codes, use --provider openai or --provider anthropic. LLM providers accept any code in the table.

Pipeline

7-step pipeline:

  1. Parse layout.json (produced by MinerU Online API) → Pydantic LayoutDocument
  2. Extract fonts from source PDF via PyMuPDF (optional)
  3. Collect translation units with <eq> placeholders for inline equations
  4. Translate via pluggable engine (batch + cache)
  5. Fit text into bounding boxes — two-pass font sizing + whitespace expansion
  6. Resolve overlaps per page (shrink font → compress leading → push down)
  7. Render PDF with ReportLab Canvas

Translation Strategy

Block Type Decision
text, title, index Translate
header, footer, page_number Translate
table_body (HTML) Translate (preserve HTML structure)
table_caption, image_caption Translate
interline_equation, code Preserve
image_body Preserve (render image at position)
ref_text (references) Preserve

Testing

pytest tests/ -v
pytest tests/test_parser.py -v  # single file

License

Apache 2.0


中文说明

基于 MinerU Online API 的 PDF 版式还原翻译工具。

翻译 PDF 的同时保持块位置、字体、图片、表格和公式不变。支持 DeepL、OpenAI 兼容 LLM 和 Anthropic 作为翻译后端。

工作原理

PDF ──► [MinerU Online API] ──► layout.json ──► mineru-translate ──► translated.pdf

mineru-translate 依赖 MinerU Online API 解析 PDF。 MinerU 负责 OCR、版式分析、表格检测和公式提取,生成结构化的 layout.json,mineru-translate 再对其进行翻译和重新渲染。

使用方式:

  • 传入已有的 layout.json-l layout.json
  • 让 mineru-translate 自动调用 MinerU Online API(--mineru-token $MINERU_TOKEN

MinerU 相关资源:


效果预览

原文 译文
原始 PDF 页面 翻译后 PDF 页面

特性

  • 版式还原 — 保持块位置、页面尺寸、字体样式和阅读顺序
  • 自适应字号 — 翻译文本变长时自动缩小字号、向周围空白区域扩展
  • 图片与公式保留 — 图片原位渲染;行间公式使用原始图片;行内公式转换为 Unicode
  • HTML 表格翻译 — 通过 ReportLab 保持表格结构(跨行/跨列)
  • 旋转块支持 — 正确处理 90°/270° 旋转的文本和表格
  • 重叠检测与消解 — 逐页级联算法检测并解决块间碰撞
  • 翻译缓存 — 基于 JSON 文件的缓存,避免重复翻译
  • 多引擎支持 — DeepL、OpenAI 兼容 LLM、Anthropic

安装

pip install mineru-translate

系统字体:中文输出需要 Noto Sans CJK SC。Windows 上会自动使用 SimHei 和 SimSun。

说明:当前版式还原效果仍在持续改进中。如果你遇到效果不理想、排版明显异常等问题,欢迎提交 issue 或附带样例文件 / 截图反馈。

快速开始

# 通过 MinerU Online API 自动解析后翻译(推荐)
mineru-translate doc.pdf -t zh --provider deepl --api-key $DEEPL_API_KEY \
  --mineru-token $MINERU_TOKEN

# 使用已有 layout.json(跳过 MinerU Online API 调用)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl --api-key $DEEPL_API_KEY

# 输出到指定目录
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
  --api-key $DEEPL_API_KEY -o /output/dir/

# 导出为 Markdown
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
  --api-key $DEEPL_API_KEY --format md -o /output/dir/

# 导出双语 Markdown(原文 + 译文)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
  --api-key $DEEPL_API_KEY --format md --md-mode bilingual

CLI 参数

mineru-translate PDF [OPTIONS]
参数 简写 默认值 说明
--layout -l 已有的 MinerU layout.json 路径(跳过 API 调用)
--output -o <pdf>_translated.pdf 输出路径:.pdf 输出 PDF,.md 输出 Markdown,也可以是目录
--format -f pdf 输出格式:pdfmd
--source -s auto 源语言代码(auto = 自动检测)
--target -t zh 目标语言代码(必填,见下表)
--provider -p openai 翻译提供商:openai / anthropic / deepl
--model -m 提供商默认值 模型名称(LLM 提供商)
--api-key -k 环境变量 API Key
--base-url 自定义 OpenAI 兼容接口地址
--concurrency -c 8 并发翻译请求数
--mineru-token $MINERU_TOKEN MinerU Online API Token(不传 --layout 时必填)
--md-mode translated Markdown 模式:translated(仅译文)/ bilingual(原文 + 译文)
--keep-cache 保留翻译缓存和解析目录(便于调试)
--verbose -v Info 级别日志
--debug -d Debug 级别日志

支持的语言

--source--target 均接受 BCP 47 语言代码。--source 额外支持 auto(默认,自动检测);--target 不接受 auto

代码 语言 代码 语言
zh 中文(简体) zh-hans 中文(简体)
zh-hant 中文(繁体) zh-cn 中文(中国大陆)
zh-tw 中文(台湾) zh-hk 中文(香港)
en 英语 en-us 英语(美国)
en-gb 英语(英国) ja 日语
ko 韩语 ar 阿拉伯语
de 德语 fr 法语
es 西班牙语 pt 葡萄牙语
pt-br 葡萄牙语(巴西) ru 俄语
it 意大利语 nl 荷兰语
pl 波兰语 sv 瑞典语
tr 土耳其语 uk 乌克兰语
vi 越南语 th 泰语
id 印度尼西亚语 hi 印地语

完整代码列表见英文部分的语言表。

注意:DeepL 仅支持其中部分代码。如需翻译到其他语言,请使用 --provider openai--provider anthropic

流水线说明

  1. 解析 MinerU Online API 生成的 layout.json → Pydantic 模型
  2. 提取字体 — 从原始 PDF 提取字体信息(可选)
  3. 收集翻译单元 — 合并文本,用 <eq> 标签保护行内公式
  4. 翻译 — 通过翻译引擎批量翻译(支持缓存)
  5. 版式适配 — 双 Pass 字号确定 + 空白区域分析 + bbox 扩展
  6. 重叠消解 — 逐页检测碰撞,通过缩字号→压行距→下推解决
  7. PDF 渲染 — 使用 ReportLab Canvas 输出最终 PDF

测试

pytest tests/ -v

许可证

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_translate-0.1.2.tar.gz (83.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mineru_translate-0.1.2-py3-none-any.whl (84.5 kB view details)

Uploaded Python 3

File details

Details for the file mineru_translate-0.1.2.tar.gz.

File metadata

  • Download URL: mineru_translate-0.1.2.tar.gz
  • Upload date:
  • Size: 83.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for mineru_translate-0.1.2.tar.gz
Algorithm Hash digest
SHA256 df8bf4e56a842e0537e2af61170af0e9dc7952dc7b8a5db0d768430c45f94aff
MD5 c7a2d561b68db33475673404ddbd1aa4
BLAKE2b-256 f48df94f9816b24cbbc2a84a92c4aafbf7ff33e780a5e5f3bdeeef025a41a653

See more details on using hashes here.

File details

Details for the file mineru_translate-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for mineru_translate-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b4601c93430bb68978bbdadb0109524bfb5f5d28e140acf747289a954ad5200b
MD5 605d6e9488547ed00318b5cb006e41e4
BLAKE2b-256 ab39d83b779a194b76cc6bdf0b3bb37e9cf97b0a7d106a413303949ec780c090

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page