PDF layout-preserving translation based on MinerU Online API output
Project description
mineru-translate
Layout-preserving PDF translation powered by MinerU Online API.
Translates PDFs while keeping block positions, fonts, images, tables, and equations in place. Supports DeepL, OpenAI-compatible LLMs, and Anthropic as translation backends.
The layout engine is still being improved. If you run into poor rendering quality or obvious layout issues, please open an issue or send feedback with sample files/screenshots.
How It Works
PDF ──► [MinerU Online API] ──► layout.json ──► mineru-translate ──► translated.pdf
mineru-translate relies on MinerU Online API to parse PDFs.
MinerU performs OCR, layout analysis, table detection, and equation extraction, producing a structured layout.json that mineru-translate then translates and re-renders.
You can either:
- Pass a
layout.jsonyou already have (-l layout.json) - Let mineru-translate call the MinerU Online API automatically (
--mineru-token $MINERU_TOKEN)
MinerU resources:
- API documentation: https://mineru.net/apiManage/docs
- Python SDK: https://github.com/opendatalab/MinerU-Ecosystem/tree/main/sdk/python
Preview
| Original | Translated |
|---|---|
Features
- Layout fidelity — block positions, page sizes, font styles, and reading order are preserved
- Adaptive font sizing — automatically shrinks fonts and expands into surrounding whitespace when translated text is longer
- Image & equation preservation — images rendered at original positions; interline equations use source images; inline equations converted to Unicode (Δ, r², Q₅₀)
- HTML table translation — table structure (rowspan / colspan) maintained via ReportLab
- Rotation support — correctly handles 90° / 270° rotated blocks and tables
- Overlap resolution — per-page cascade algorithm detects and resolves block collisions
- Translation cache — file-based JSON cache avoids re-translating unchanged content
- Multiple providers — DeepL, OpenAI-compatible LLMs, Anthropic
Installation
pip install mineru-translate
System fonts: CJK output requires Noto Sans CJK SC. On Windows, SimHei and SimSun are used automatically.
Quick Start
# Auto-parse via MinerU Online API, then translate (recommended)
mineru-translate doc.pdf -t zh --provider deepl --api-key $DEEPL_API_KEY \
--mineru-token $MINERU_TOKEN
# Use a pre-parsed layout.json (skip MinerU Online API call)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl --api-key $DEEPL_API_KEY
# Output to a specific directory
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
--api-key $DEEPL_API_KEY -o /output/dir/
# Export as Markdown
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
--api-key $DEEPL_API_KEY --format md -o /output/dir/
# Export bilingual Markdown (original + translation)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
--api-key $DEEPL_API_KEY --format md --md-mode bilingual
CLI Options
mineru-translate PDF [OPTIONS]
| Option | Short | Default | Description |
|---|---|---|---|
--layout |
-l |
— | Path to pre-parsed MinerU layout.json (skips MinerU Online API) |
--output |
-o |
same dir as input | Output directory or file path |
--format |
-f |
pdf |
Output format: pdf or md |
--source |
-s |
auto |
Source language code (auto = detect) |
--target |
-t |
zh |
Target language code (required; see table below) |
--provider |
-p |
openai |
Translation provider: openai / anthropic / deepl |
--model |
-m |
provider default | Model name (LLM providers) |
--api-key |
-k |
env var | API key |
--base-url |
— | Custom base URL for OpenAI-compatible endpoints | |
--concurrency |
-c |
8 |
Concurrent translation requests |
--mineru-token |
$MINERU_TOKEN |
MinerU Online API token (required when --layout is not provided) |
|
--md-mode |
translated |
Markdown mode: translated (translation only) / bilingual (original + translation) |
|
--keep-cache |
Keep translation cache and parsed directory (useful for debugging) | ||
--verbose |
-v |
Info-level logging | |
--debug |
-d |
Debug-level logging |
Environment Variables
| Variable | Used by |
|---|---|
OPENAI_API_KEY |
--provider openai |
ANTHROPIC_API_KEY |
--provider anthropic |
DEEPL_API_KEY |
--provider deepl |
MINERU_TOKEN |
--mineru-token |
Supported Languages
Both --source and --target accept BCP 47 language codes. --source additionally accepts auto (default) for automatic detection. --target does not accept auto.
| Code | Language | Code | Language |
|---|---|---|---|
zh |
Chinese (Simplified) | en |
English |
zh-hans |
Chinese (Simplified) | zh-hant |
Chinese (Traditional) |
zh-cn |
Chinese (Mainland) | zh-tw |
Chinese (Taiwan) |
zh-hk |
Chinese (Hong Kong) | ja |
Japanese |
ko |
Korean | ar |
Arabic |
de |
German | de-de |
German (Germany) |
de-at |
German (Austria) | de-ch |
German (Switzerland) |
fr |
French | fr-fr |
French (France) |
fr-ca |
French (Canada) | es |
Spanish |
es-es |
Spanish (Spain) | es-mx |
Spanish (Mexico) |
pt |
Portuguese | pt-br |
Portuguese (Brazil) |
pt-pt |
Portuguese (Portugal) | ru |
Russian |
it |
Italian | nl |
Dutch |
pl |
Polish | sv |
Swedish |
da |
Danish | nb |
Norwegian Bokmål |
nn |
Norwegian Nynorsk | fi |
Finnish |
cs |
Czech | sk |
Slovak |
hu |
Hungarian | ro |
Romanian |
bg |
Bulgarian | hr |
Croatian |
sl |
Slovenian | sr |
Serbian |
el |
Greek | tr |
Turkish |
uk |
Ukrainian | he |
Hebrew |
fa |
Persian | hi |
Hindi |
bn |
Bengali | th |
Thai |
vi |
Vietnamese | id |
Indonesian |
ms |
Malay | ca |
Catalan |
af |
Afrikaans | sw |
Swahili |
eo |
Esperanto | ga |
Irish |
cy |
Welsh | is |
Icelandic |
mk |
Macedonian | sq |
Albanian |
hy |
Armenian | ka |
Georgian |
et |
Estonian | lv |
Latvian |
lt |
Lithuanian | mt |
Maltese |
gl |
Galician | ur |
Urdu |
ml |
Malayalam | kn |
Kannada |
gu |
Gujarati | mr |
Marathi |
ta |
Tamil | te |
Telugu |
Note: DeepL supports a subset of the above codes. For unsupported codes, use
--provider openaior--provider anthropic. LLM providers accept any code in the table.
Pipeline
7-step pipeline:
- Parse
layout.json(produced by MinerU Online API) → PydanticLayoutDocument - Extract fonts from source PDF via PyMuPDF (optional)
- Collect translation units with
<eq>placeholders for inline equations - Translate via pluggable engine (batch + cache)
- Fit text into bounding boxes — two-pass font sizing + whitespace expansion
- Resolve overlaps per page (shrink font → compress leading → push down)
- Render PDF with ReportLab Canvas
Translation Strategy
| Block Type | Decision |
|---|---|
text, title, index |
Translate |
header, footer, page_number |
Translate |
table_body (HTML) |
Translate (preserve HTML structure) |
table_caption, image_caption |
Translate |
interline_equation, code |
Preserve |
image_body |
Preserve (render image at position) |
ref_text (references) |
Preserve |
Testing
pytest tests/ -v
pytest tests/test_parser.py -v # single file
License
Apache 2.0
中文说明
基于 MinerU Online API 的 PDF 版式还原翻译工具。
翻译 PDF 的同时保持块位置、字体、图片、表格和公式不变。支持 DeepL、OpenAI 兼容 LLM 和 Anthropic 作为翻译后端。
工作原理
PDF ──► [MinerU Online API] ──► layout.json ──► mineru-translate ──► translated.pdf
mineru-translate 依赖 MinerU Online API 解析 PDF。
MinerU 负责 OCR、版式分析、表格检测和公式提取,生成结构化的 layout.json,mineru-translate 再对其进行翻译和重新渲染。
使用方式:
- 传入已有的
layout.json(-l layout.json) - 让 mineru-translate 自动调用 MinerU Online API(
--mineru-token $MINERU_TOKEN)
MinerU 相关资源:
- API 文档:https://mineru.net/apiManage/docs
- Python SDK:https://github.com/opendatalab/MinerU-Ecosystem/tree/main/sdk/python
效果预览
| 原文 | 译文 |
|---|---|
特性
- 版式还原 — 保持块位置、页面尺寸、字体样式和阅读顺序
- 自适应字号 — 翻译文本变长时自动缩小字号、向周围空白区域扩展
- 图片与公式保留 — 图片原位渲染;行间公式使用原始图片;行内公式转换为 Unicode
- HTML 表格翻译 — 通过 ReportLab 保持表格结构(跨行/跨列)
- 旋转块支持 — 正确处理 90°/270° 旋转的文本和表格
- 重叠检测与消解 — 逐页级联算法检测并解决块间碰撞
- 翻译缓存 — 基于 JSON 文件的缓存,避免重复翻译
- 多引擎支持 — DeepL、OpenAI 兼容 LLM、Anthropic
安装
pip install mineru-translate
系统字体:中文输出需要 Noto Sans CJK SC。Windows 上会自动使用 SimHei 和 SimSun。
说明:当前版式还原效果仍在持续改进中。如果你遇到效果不理想、排版明显异常等问题,欢迎提交 issue 或附带样例文件 / 截图反馈。
快速开始
# 通过 MinerU Online API 自动解析后翻译(推荐)
mineru-translate doc.pdf -t zh --provider deepl --api-key $DEEPL_API_KEY \
--mineru-token $MINERU_TOKEN
# 使用已有 layout.json(跳过 MinerU Online API 调用)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl --api-key $DEEPL_API_KEY
# 输出到指定目录
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
--api-key $DEEPL_API_KEY -o /output/dir/
# 导出为 Markdown
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
--api-key $DEEPL_API_KEY --format md -o /output/dir/
# 导出双语 Markdown(原文 + 译文)
mineru-translate doc.pdf -l layout.json -t zh --provider deepl \
--api-key $DEEPL_API_KEY --format md --md-mode bilingual
CLI 参数
mineru-translate PDF [OPTIONS]
| 参数 | 简写 | 默认值 | 说明 |
|---|---|---|---|
--layout |
-l |
— | 已有的 MinerU layout.json 路径(跳过 API 调用) |
--output |
-o |
<pdf>_translated.pdf |
输出路径:.pdf 输出 PDF,.md 输出 Markdown,也可以是目录 |
--format |
-f |
pdf |
输出格式:pdf 或 md |
--source |
-s |
auto |
源语言代码(auto = 自动检测) |
--target |
-t |
zh |
目标语言代码(必填,见下表) |
--provider |
-p |
openai |
翻译提供商:openai / anthropic / deepl |
--model |
-m |
提供商默认值 | 模型名称(LLM 提供商) |
--api-key |
-k |
环境变量 | API Key |
--base-url |
— | 自定义 OpenAI 兼容接口地址 | |
--concurrency |
-c |
8 |
并发翻译请求数 |
--mineru-token |
$MINERU_TOKEN |
MinerU Online API Token(不传 --layout 时必填) |
|
--md-mode |
translated |
Markdown 模式:translated(仅译文)/ bilingual(原文 + 译文) |
|
--keep-cache |
保留翻译缓存和解析目录(便于调试) | ||
--verbose |
-v |
Info 级别日志 | |
--debug |
-d |
Debug 级别日志 |
支持的语言
--source 和 --target 均接受 BCP 47 语言代码。--source 额外支持 auto(默认,自动检测);--target 不接受 auto。
| 代码 | 语言 | 代码 | 语言 |
|---|---|---|---|
zh |
中文(简体) | zh-hans |
中文(简体) |
zh-hant |
中文(繁体) | zh-cn |
中文(中国大陆) |
zh-tw |
中文(台湾) | zh-hk |
中文(香港) |
en |
英语 | en-us |
英语(美国) |
en-gb |
英语(英国) | ja |
日语 |
ko |
韩语 | ar |
阿拉伯语 |
de |
德语 | fr |
法语 |
es |
西班牙语 | pt |
葡萄牙语 |
pt-br |
葡萄牙语(巴西) | ru |
俄语 |
it |
意大利语 | nl |
荷兰语 |
pl |
波兰语 | sv |
瑞典语 |
tr |
土耳其语 | uk |
乌克兰语 |
vi |
越南语 | th |
泰语 |
id |
印度尼西亚语 | hi |
印地语 |
完整代码列表见英文部分的语言表。
注意:DeepL 仅支持其中部分代码。如需翻译到其他语言,请使用
--provider openai或--provider anthropic。
流水线说明
- 解析 MinerU Online API 生成的
layout.json→ Pydantic 模型 - 提取字体 — 从原始 PDF 提取字体信息(可选)
- 收集翻译单元 — 合并文本,用
<eq>标签保护行内公式 - 翻译 — 通过翻译引擎批量翻译(支持缓存)
- 版式适配 — 双 Pass 字号确定 + 空白区域分析 + bbox 扩展
- 重叠消解 — 逐页检测碰撞,通过缩字号→压行距→下推解决
- PDF 渲染 — 使用 ReportLab Canvas 输出最终 PDF
测试
pytest tests/ -v
许可证
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mineru_translate-0.1.2.tar.gz.
File metadata
- Download URL: mineru_translate-0.1.2.tar.gz
- Upload date:
- Size: 83.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df8bf4e56a842e0537e2af61170af0e9dc7952dc7b8a5db0d768430c45f94aff
|
|
| MD5 |
c7a2d561b68db33475673404ddbd1aa4
|
|
| BLAKE2b-256 |
f48df94f9816b24cbbc2a84a92c4aafbf7ff33e780a5e5f3bdeeef025a41a653
|
File details
Details for the file mineru_translate-0.1.2-py3-none-any.whl.
File metadata
- Download URL: mineru_translate-0.1.2-py3-none-any.whl
- Upload date:
- Size: 84.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4601c93430bb68978bbdadb0109524bfb5f5d28e140acf747289a954ad5200b
|
|
| MD5 |
605d6e9488547ed00318b5cb006e41e4
|
|
| BLAKE2b-256 |
ab39d83b779a194b76cc6bdf0b3bb37e9cf97b0a7d106a413303949ec780c090
|