Intelligent PDF to Markdown converter using glmocr SDK

These details have not been verified by PyPI

Project links

Project description

markitdown-glmocr

智能 PDF 转 Markdown 插件，使用 glmocr SDK（智谱 GLM-OCR）驱动的图片和表格提取。

特性

🔍 智能检测：自动识别每页内容类型（纯文本 vs 图片/表格）
📄 默认解析：纯文本页面使用 pdfplumber/pdfminer 提取，速度快、成本低
🤖 AI 增强：复杂页面（图片、表格）使用 glmocr SDK 转换为 Markdown
⚡ 一行调用：glmocr.parse("document.pdf") 完成 OCR，无需手动截图编码
📊 结构化输出：返回 Markdown + JSON 结构（含区域标签、边界框）

安装

# 基础安装
pip install markitdown-glmocr

# 安装 AI 功能
pip install markitdown-glmocr[glmocr]

配置

环境变量（推荐）

# 必需：智谱 API Key
export ZHIPU_API_KEY="your-zhipu-api-key"

# 可选
export GLMOCR_MODEL="glm-ocr"          # 模型名称
export GLMOCR_TIMEOUT="600"             # 请求超时（秒）
export GLMOCR_ENABLE_LAYOUT="true"      # 启用布局检测
export GLMOCR_LOG_LEVEL="INFO"          # 日志级别

配置优先级

构造函数参数 > 环境变量 > .env 文件 > config.yaml > 内置默认值

本地敏感配置

# 创建 .env 文件（自动读取）
echo "ZHIPU_API_KEY=your-api-key" > .env

使用方法

命令行（推荐）

# 1. 设置 API Key
export ZHIPU_API_KEY="sk-xxx"

# 2. 查看已安装插件
markitdown --list-plugins

# 3. 使用插件转换 PDF
markitdown -p document.pdf

# 4. 保存到文件
markitdown -p document.pdf -o output.md

Python API

from markitdown import MarkItDown
from markitdown_glmocr import GlmOcrConverter

# 方式1：自动从环境变量读取 ZHIPU_API_KEY
converter = GlmOcrConverter()
md = MarkItDown(enable_plugins=False)
md.register_converter(converter, priority=-1.0)
result = md.convert("document.pdf")
print(result.markdown)

# 方式2：手动传入 API Key
converter = GlmOcrConverter(api_key="sk-xxx")
md = MarkItDown(enable_plugins=False)
md.register_converter(converter, priority=-1.0)
result = md.convert("document.pdf")
print(result.markdown)

# 方式3：直接使用 glmocr SDK（更简单）
import glmocr
result = glmocr.parse("document.pdf")
print(result.markdown_result)  # Markdown 输出
print(result.json_result)      # 结构化 JSON（区域标签、边界框）

处理结果

import glmocr

result = glmocr.parse("report.pdf")

# 获取 Markdown
print(result.markdown_result)

# 获取结构化数据（按页分组）
for page_idx, page_regions in enumerate(result.json_result):
    print(f"Page {page_idx + 1}: {len(page_regions)} regions")
    for region in page_regions:
        print(f"  [{region['label']}] {region['content'][:60]}")

# 按标签筛选
tables = [r for r in result.json_result[0] if r["label"] == "table"]
formulas = [r for r in result.json_result[0] if r["label"] == "formula"]

# 保存到磁盘
result.save(output_dir="./output")

配置选项

GlmOcrConverter 参数

参数	类型	默认值	说明
`api_key`	str	环境变量 `ZHIPU_API_KEY`	智谱 API Key
`timeout`	int	1800	请求超时（秒）
`enable_layout`	bool	False	启用布局检测
`force_ai`	bool	False	强制所有页面使用 AI

环境变量

变量	说明	示例
`ZHIPU_API_KEY`	API Key（必需）	`sk-abc123`
`GLMOCR_MODEL`	模型名称	`glm-ocr`
`GLMOCR_TIMEOUT`	请求超时（秒）	`600`
`GLMOCR_ENABLE_LAYOUT`	布局检测	`true`
`GLMOCR_LOG_LEVEL`	日志级别	`INFO`

工作原理

PDF 输入
    │
    ▼
逐页分析内容类型
    │
    ├─ 纯文本页面 ──► pdfplumber 提取文本
    │
    └─ 复杂页面（图片/表格）
          │
          └─► glmocr.parse() 一行调用
                │
                ├─ 内置截图渲染
                ├─ 内置 base64 编码
                └─ 内置 OCR 识别
    │
    ▼
合并输出完整 Markdown

区域标签（json_result）

glmocr SDK 返回的结构化数据支持以下标签：

标签	说明
`title`	标题
`text`	正文文本
`table`	表格
`figure`	图片
`formula`	公式
`header`	页眉
`footer`	页脚
`page_number`	页码
`reference`	参考文献
`seal`	印章

技术架构

glmocr: 智谱 OCR SDK，一行代码完成 PDF/图片解析
pdfplumber: PDF 页面分析和纯文本提取
pdfminer: 纯文本页面提取备用

依赖

markitdown>=0.1.0 - 基础框架
pdfplumber>=0.11.9 - PDF 解析和截图
pdfminer.six>=20251230 - 文本提取备用
Pillow>=9.0.0 - 图像处理
glmocr - 智谱 OCR SDK（可选，AI 功能需要）

发布到 PyPI

前置条件

安装构建工具：

pip install build twine hatch

配置 PyPI API Token（Windows 用户环境变量）：

# PowerShell 设置用户环境变量
[System.Environment]::SetEnvironmentVariable('PYPI_API_TOKEN', 'pypi-...', 'User')

或在 Bash/Zsh 中：

export PYPI_API_TOKEN="pypi-..."

快速发布（推荐）

项目根目录提供了上传脚本，可一键发布两个插件：

Bash / Git Bash:

# 构建两个插件
cd packages/markitdown-glmocr && hatch build

cd ../markitdown-paddleocr && hatch build

# 上传（自动上传所有构建的版本）
cd ../..
./scripts/pypi-upload.sh

# 或指定版本号
./scripts/pypi-upload.sh 0.2.0

PowerShell:

# 构建两个插件
cd packages/markitdown-glmocr; hatch build
cd ../markitdown-paddleocr; hatch build

# 上传
cd ../..
.\scripts\pypi-upload.ps1

# 或指定版本号
.\scripts\pypi-upload.ps1 -Version "0.2.0"

手动发布

# 1. 进入项目目录
cd packages/markitdown-glmocr

# 2. 构建
hatch build

# 3. 检查
twine check dist/*

# 4. 上传
twine upload --username __token__ --password "$PYPI_API_TOKEN" --disable-progress-bar dist/*

发布到 TestPyPI（测试）

twine upload --repository testpypi --username __token__ --password "$PYPI_API_TOKEN" --disable-progress-bar dist/*

# 从 TestPyPI 安装验证
pip install --index-url https://test.pypi.org/simple/ markitdown-glmocr

注意事项

发布前确保 src/markitdown_glmocr/__about__.py 中的版本号已更新
同一版本号不能重复上传，如需修正必须 bump 版本号
PYPI_API_TOKEN 切勿提交到代码仓库

许可证

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.3

Jun 2, 2026

0.2.2

Jun 2, 2026

0.2.0

May 21, 2026

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_glmocr-0.2.3.tar.gz (11.7 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markitdown_glmocr-0.2.3-py3-none-any.whl (11.6 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file markitdown_glmocr-0.2.3.tar.gz.

File metadata

Download URL: markitdown_glmocr-0.2.3.tar.gz
Upload date: Jun 2, 2026
Size: 11.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for markitdown_glmocr-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`00d7f26dd7c2d96f25f24a42f0174b0e61081689a3918c7f2fc0386ab51a3bdc`
MD5	`06c4ca7fb6187c97a276ec8c3f81ed55`
BLAKE2b-256	`7e9620b5915fa2b08fd01964fe5c40a5b1c8a2bdf4fc392b46f3cae8e743fdf9`

See more details on using hashes here.

File details

Details for the file markitdown_glmocr-0.2.3-py3-none-any.whl.

File metadata

Download URL: markitdown_glmocr-0.2.3-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 11.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for markitdown_glmocr-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b263c2021bf29f91e43478628f1172782fb1f504725405274e416327c0f7082f`
MD5	`0b23860ae3acf7efd2acdf6efb1eeae5`
BLAKE2b-256	`2490c63fa2a09937c9a45cafd70b450aea1f7511c7d4e6905f61eefe7cc63609`

See more details on using hashes here.

markitdown-glmocr 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

markitdown-glmocr

特性

安装

配置

环境变量（推荐）

配置优先级

本地敏感配置

使用方法

命令行（推荐）

Python API

处理结果

配置选项

GlmOcrConverter 参数

环境变量

工作原理

区域标签（json_result）

技术架构

依赖

发布到 PyPI

前置条件

快速发布（推荐）

手动发布

发布到 TestPyPI（测试）

注意事项

许可证

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes