A PDF to HTML converter focused on layout preservation and table extraction.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PDF to HTML Engine

一个高性能的 PDF 转 HTML 引擎，能够精准还原 PDF 布局并提取表格，特别针对中文文档和复杂表单进行了优化。

核心功能

布局精准还原：通过计算文本坐标，使用 padding-left 和动态字体大小还原 PDF 的原始视觉布局。
智能表格提取：利用 pdfplumber 识别表格，支持复杂的单元格合并（rowspan/colspan）和文本对齐。
下划线处理：自动识别文档中的填空下划线（包括文本中的和图形绘制的），并在 HTML 中保留。
中文字体支持：内置对常用中文字体（如微软雅黑、宋体、华文细黑等）的 CSS 优化。
模块化设计：包含加载器、布局引擎、表格提取器和渲染器，易于扩展和集成。

技术栈

核心语言：Python 3.12+
PDF 处理：PyMuPDF (fitz) & pdfplumber
模板引擎：Jinja2
布局算法：基于坐标的行归类与间距计算

快速开始

1. 环境配置

建议使用虚拟环境：

# 创建虚拟环境
python -m venv venv

# 激活虚拟环境 (macOS/Linux)
source venv/bin/activate

# 安装依赖
pip install -r requirements.txt

2. 命令行使用 (调试功能)

# 转换全文档
python main.py path/to/your/input.pdf -o output.html

# 仅转换指定页面 (支持 1,2,3 或 1-5 格式)
python main.py path/to/your/input.pdf -p 1,3-5 -o output.html

3. 作为包集成 (API 调用)

现在你可以直接在其他 Python 项目中调用此包，获取 HTML 代码片段：

from pdf2html_engine import parse_pdf

# 转换指定页面，返回 HTML 代码片段
file_path = "example.pdf"
html_snippets = parse_pdf(file_path, pages=[1, 2])

print(html_snippets)

如果你需要更多控制权，可以使用 PDFConverter 类：

from pdf2html_engine import PDFConverter

converter = PDFConverter("example.pdf")

# 获取 HTML 片段 (默认)
snippets = converter.convert(pages=[1, 2])

# 获取完整的 HTML 文档 (包含 <html><body> 等标签)
full_doc = converter.convert(pages=[1, 2], full_html=True)

项目结构

pdf2html-engine/
├── main.py              # 程序入口，命令行工具
├── requirements.txt     # 项目依赖
├── pdf2html_engine/     # 核心解析引擎包
│   ├── __init__.py      # 暴露核心 API (PDFConverter, parse_pdf)
│   ├── converter.py     # 转换流程协调器，处理跨页逻辑
│   ├── pdf_loader.py    # 负责 PDF 文件的双引擎加载 (PyMuPDF + pdfplumber)
│   ├── layout_engine.py # 布局分析，处理文本、字体大小及位置
│   ├── table_extractor.py # 表格识别与 HTML 转换，处理合并单元格
│   └── html_renderer.py # 基于 Jinja2 的 HTML 生成器
└── venv/                # 虚拟环境目录（已忽略）

架构原理

1. 文本与布局 (LayoutEngine)

引擎通过 PyMuPDF 获取 PDF 中的每个文本块（Span）。通过计算其 bbox 坐标，将具有相近垂直坐标的文本归类为一行。对于每一行，通过 padding-left 实现相对于页面宽度的水平定位，通过 margin-top 实现行间距还原。

2. 表格处理 (TableExtractor)

利用 pdfplumber 强大的表格定位能力。通过分析表格的单元格结构，算法会自动识别合并的单元格并转换为 HTML 的 rowspan 和 colspan。同时，引擎会检测单元格内的细微线条，将其识别为填空下划线。

3. 混合排序

在处理每一页时，引擎会先提取表格位置，然后在提取普通文本布局时过滤掉表格所在的区域。最后，将文本行和表格块根据垂直坐标（y 轴）重新排序，确保生成的 HTML 逻辑顺序与原文档一致。

许可证

MIT License

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.0.7

Apr 30, 2026

1.0.6

Apr 30, 2026

1.0.5

Apr 27, 2026

1.0.4

Apr 24, 2026

1.0.3

Apr 24, 2026

1.0.2

Apr 22, 2026

1.0.1

Apr 22, 2026

1.0.0

Apr 22, 2026

0.1.5

Apr 22, 2026

0.1.4

Apr 20, 2026

0.1.3

Apr 20, 2026

0.1.2

Apr 20, 2026

This version

0.1.1

Apr 17, 2026

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2html_engine-0.1.1.tar.gz (17.0 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2html_engine-0.1.1-py3-none-any.whl (17.1 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file pdf2html_engine-0.1.1.tar.gz.

File metadata

Download URL: pdf2html_engine-0.1.1.tar.gz
Upload date: Apr 17, 2026
Size: 17.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdf2html_engine-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e6052fb68bdfa48bad26316e022b1fb060ffc6eacc83193ab06037f16be45f91`
MD5	`ecd5bac452fb579039f2c657f74130f2`
BLAKE2b-256	`d2b3943d5603d88e843741b327f27be434fd492d3acce2cfe91addca4ef2c425`

See more details on using hashes here.

File details

Details for the file pdf2html_engine-0.1.1-py3-none-any.whl.

File metadata

Download URL: pdf2html_engine-0.1.1-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 17.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdf2html_engine-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9cd2e7204bbcbe7e87f4de4d18b6979971ad769c9724ec6450a58c768e7b750a`
MD5	`61e81bb20a3e24f3fe6a9deab95b99b9`
BLAKE2b-256	`c8ae4abcec5a5913930291cc9dc398dfa0531c9c51ffcd8cc057e94da92fb3fd`

See more details on using hashes here.

pdf2html-engine 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF to HTML Engine

核心功能

技术栈

快速开始

1. 环境配置

2. 命令行使用 (调试功能)

3. 作为包集成 (API 调用)

项目结构

架构原理

1. 文本与布局 (LayoutEngine)

2. 表格处理 (TableExtractor)

3. 混合排序

许可证

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes