一个精悍的 PDF OCR 处理 Python 包，使用 Mistral AI OCR API

These details have not been verified by PyPI

Project links

Project description

PDF2TXT

一个精悍的 PDF OCR 处理 Python 包，使用 Mistral AI OCR API 将 PDF 文件转换为文本。

特性

🚀 简单易用：简洁的 API 设计，几行代码即可使用
💾 智能缓存：基于文件哈希的缓存机制，避免重复处理
🔄 自动重试：内置重试机制，提高处理成功率
📄 多种输入：支持文件路径、字节流、文件对象等多种输入方式
🎯 灵活配置：可配置表格格式、页眉页脚提取等选项

安装

从 PyPI 安装（推荐，发布后可用）

pip install pdf2txt

⚠️ 注意: 目前包尚未发布到 PyPI。如果无法安装，请使用下面的方式。

从 Git 仓库安装

pip install git+https://github.com/yourusername/pdf2txt.git

从本地路径安装

# 方式 1: 从本地目录安装
pip install /path/to/pdf2txt/

# 方式 2: 安装分发包（如果已有 .whl 或 .tar.gz 文件）
pip install pdf2txt-1.0.0-py3-none-any.whl

从源码安装（开发模式）

# 克隆或下载源码
git clone https://github.com/yourusername/pdf2txt.git
cd pdf2txt

# 安装（开发模式，修改源码立即生效）
pip install -e .

构建分发包用于其他电脑

# 在当前电脑构建
cd pdf2txt
./scripts/build_for_distribution.sh

# 将 dist/ 目录传输到其他电脑
# 在其他电脑安装
pip install dist/pdf2txt-*-py3-none-any.whl

详细安装说明请查看 INSTALL_OPTIONS.md

快速开始

基本使用

from pdf2txt import PDFOCRProcessor

# 初始化处理器（API Key 可从环境变量 MISTRAL_API_KEY 读取）
processor = PDFOCRProcessor()

# 从文件路径处理
text = processor.process_from_path("document.pdf")
print(text)

# 从字节流处理
with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()
text = processor.process_from_bytes(pdf_bytes, filename="document.pdf")

使用缓存

from pdf2txt import PDFOCRProcessor, PDFCache

processor = PDFOCRProcessor()
cache = PDFCache(cache_dir="my_cache")

# 读取 PDF
with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

# 检查缓存
cached_text = cache.get(pdf_bytes, filename="document.pdf")
if cached_text:
    print("使用缓存结果")
    text = cached_text
else:
    # 处理 PDF
    text = processor.process_from_bytes(pdf_bytes, filename="document.pdf")
    # 保存到缓存
    cache.set(pdf_bytes, text, filename="document.pdf")

配置选项

processor = PDFOCRProcessor()

# 自定义处理选项
text = processor.process_from_path(
    "document.pdf",
    model="mistral-ocr-latest",           # OCR 模型
    table_format="markdown",              # 表格格式: "html" 或 "markdown"
    extract_header=True,                  # 提取页眉
    extract_footer=True,                  # 提取页脚
    include_image_base64=False,          # 包含图片 base64
    include_page_separator=True,          # 页面分隔符
    save_result=True,                     # 保存结果到文件
    output_path="result.md"               # 输出文件路径
)

错误处理

from pdf2txt import PDFOCRProcessor

try:
    processor = PDFOCRProcessor()
    text = processor.process_from_path("document.pdf")
except ValueError as e:
    print(f"配置错误: {e}")
except Exception as e:
    print(f"处理失败: {e}")

命令行工具 (CLI)

安装后，可以使用命令行工具 mistral_pdf_to_txt 直接处理 PDF 文件。

基本使用

# 处理单个 PDF 文件（输出到标准输出）
mistral_pdf_to_txt --pdf_path document.pdf

# 指定输出文件
mistral_pdf_to_txt --pdf_path document.pdf --output_path result.txt

# 使用缓存（避免重复处理相同文件）
mistral_pdf_to_txt --pdf_path document.pdf --output_path result.txt --use_cache

批量处理

# 批量处理多个 PDF 文件
mistral_pdf_to_txt --pdf_path "*.pdf" --output_dir results/

# 或指定具体目录
mistral_pdf_to_txt --pdf_path "/path/to/pdfs/*.pdf" --output_dir results/

高级选项

# 自定义表格格式和页面分隔符
mistral_pdf_to_txt --pdf_path document.pdf --table_format markdown --no-page-separator

# 不提取页眉页脚
mistral_pdf_to_txt --pdf_path document.pdf --no-header --no-footer

# 指定 API Key（如果不使用环境变量）
mistral_pdf_to_txt --pdf_path document.pdf --api-key your_api_key

# 详细模式（显示更多信息）
mistral_pdf_to_txt --pdf_path document.pdf --verbose

# 静默模式（减少输出）
mistral_pdf_to_txt --pdf_path document.pdf --quiet

完整参数列表

mistral_pdf_to_txt --help

主要参数：

--pdf_path: PDF 文件路径（必需，支持通配符）
--output_path: 输出文件路径（单个文件）
--output_dir: 输出目录（批量处理）
--use_cache: 启用缓存
--cache_dir: 缓存目录（默认: cache/pdf_ocr）
--model: OCR 模型（默认: mistral-ocr-latest）
--table_format: 表格格式（html/markdown）
--no-header: 不提取页眉
--no-footer: 不提取页脚
--no-page-separator: 不添加页面分隔符
--api-key: Mistral API Key
--verbose: 详细模式
--quiet: 静默模式

API 文档

PDFOCRProcessor

PDF OCR 处理器主类。

初始化

PDFOCRProcessor(api_key: Optional[str] = None)

api_key: Mistral API Key，如果为 None 则从环境变量 MISTRAL_API_KEY 读取

方法

`process(pdf_bytes, filename=None, **kwargs) -> str`

处理 PDF 字节流，返回提取的文本内容。

参数：

pdf_bytes (bytes): PDF 文件的字节流
filename (str, optional): 文件名（用于调试）
model (str, optional): OCR 模型名称，默认 "mistral-ocr-latest"
table_format (str): 表格格式，"html" 或 "markdown"，默认 "html"
extract_header (bool): 是否提取页眉，默认 True
extract_footer (bool): 是否提取页脚，默认 True
include_image_base64 (bool): 是否包含图片的 base64 编码，默认 True
include_page_separator (bool): 是否在页面之间添加分隔符，默认 True

返回： 提取的文本内容（Markdown 格式）

`process_from_path(pdf_path, save_result=False, output_path=None, **kwargs) -> str`

从文件路径处理 PDF。

参数：

pdf_path (str): PDF 文件路径
save_result (bool): 是否保存结果到文件，默认 False
output_path (str, optional): 输出文件路径，如果为 None 则自动生成
**kwargs: 传递给 process() 方法的其他参数

返回： 提取的文本内容

`process_from_bytes(pdf_bytes, filename=None, **kwargs) -> str`

从字节流处理 PDF。

参数：

pdf_bytes (bytes): PDF 文件的字节流
filename (str, optional): 文件名
**kwargs: 传递给 process() 方法的其他参数

返回： 提取的文本内容

`process_from_file(file_obj, filename=None, **kwargs) -> str`

从文件对象处理 PDF。

参数：

file_obj (IO[bytes]): 文件对象（已打开的文件）
filename (str, optional): 文件名
**kwargs: 传递给 process() 方法的其他参数

返回： 提取的文本内容

PDFCache

PDF OCR 结果缓存管理器。

初始化

PDFCache(cache_dir: str = "cache/pdf_ocr")

方法

`get(pdf_bytes, filename=None) -> Optional[str]`

从缓存中获取 PDF OCR 结果。

参数：

pdf_bytes (bytes): PDF 文件的字节流
filename (str, optional): 文件名

返回： 缓存的文本内容，如果不存在则返回 None

`set(pdf_bytes, text, filename=None) -> str`

将 PDF OCR 结果保存到缓存。

参数：

pdf_bytes (bytes): PDF 文件的字节流
text (str): OCR 提取的文本内容
filename (str, optional): 文件名

返回： 文件哈希值

`clear(older_than_days=None)`

清理缓存。

参数：

older_than_days (int, optional): 如果指定，只清理超过指定天数的缓存

`get_stats() -> dict`

获取缓存统计信息。

返回： 包含缓存统计信息的字典

工具函数

`FileHash`

文件哈希计算工具类。

calculate_file_hash(file_path, algorithm='sha256') -> Optional[str]: 计算文件的哈希值
calculate_bytes_hash(data, algorithm='sha256') -> str: 计算字节数据的哈希值
get_file_info(file_path) -> dict: 获取文件的详细信息

`retry`

重试装饰器。

@retry(max_attempts=3, delay=2.0, backoff=2.0, exceptions=(Exception,))
def my_function():
    # ...

`ErrorHandler`

错误处理器类。

handle_ocr_error(error, filename="") -> str: 处理 OCR 错误

环境变量

MISTRAL_API_KEY: Mistral API Key（必需）

发布信息

本包已发布到 PyPI，可以通过以下命令安装：

pip install pdf2txt

开发版本

如果你想使用最新开发版本：

pip install git+https://github.com/yourusername/pdf2txt.git

发布历史

查看 CHANGELOG.md 了解版本更新历史。

示例

完整示例请查看 examples/ 目录。

许可证

MIT License

贡献

欢迎提交 Issue 和 Pull Request！

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.3

Jan 27, 2026

1.0.2

Jan 27, 2026

1.0.1

Jan 27, 2026

This version

1.0.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistral_pdf2txt-1.0.0.tar.gz (18.4 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mistral_pdf2txt-1.0.0-py3-none-any.whl (20.4 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file mistral_pdf2txt-1.0.0.tar.gz.

File metadata

Download URL: mistral_pdf2txt-1.0.0.tar.gz
Upload date: Jan 27, 2026
Size: 18.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for mistral_pdf2txt-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9824b0f3326be51383725c2572c03296f027b77addf3fef7006c19d11ac98497`
MD5	`60bbfba88b465992b69689caa208ec83`
BLAKE2b-256	`7224e4a0b51f5f0847a61b62b15a6b467031d4a96cee79ba28ebfa0a7d6e6ef7`

See more details on using hashes here.

File details

Details for the file mistral_pdf2txt-1.0.0-py3-none-any.whl.

File metadata

Download URL: mistral_pdf2txt-1.0.0-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 20.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for mistral_pdf2txt-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d613704c0a91f3bc6e8630c0b38f6d07c31a29c961d7e9c6cf0692b2ef8f23ad`
MD5	`9083da5440dfc7ed365ea94069f80561`
BLAKE2b-256	`8db543bbfca42f649f1c606fbffa5f4c79dcb271b5af54d8eb2dd3aa92d92c39`

See more details on using hashes here.

mistral-pdf2txt 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF2TXT

特性

安装

从 PyPI 安装（推荐，发布后可用）

从 Git 仓库安装

从本地路径安装

从源码安装（开发模式）

构建分发包用于其他电脑

快速开始

基本使用

使用缓存

配置选项

错误处理

命令行工具 (CLI)

基本使用

批量处理

高级选项

完整参数列表

API 文档

PDFOCRProcessor

初始化

方法

process(pdf_bytes, filename=None, **kwargs) -> str

process_from_path(pdf_path, save_result=False, output_path=None, **kwargs) -> str

process_from_bytes(pdf_bytes, filename=None, **kwargs) -> str

process_from_file(file_obj, filename=None, **kwargs) -> str

PDFCache

初始化

方法

get(pdf_bytes, filename=None) -> Optional[str]

set(pdf_bytes, text, filename=None) -> str

clear(older_than_days=None)

get_stats() -> dict

工具函数

FileHash

retry

ErrorHandler

环境变量

发布信息

开发版本

发布历史

示例

许可证

贡献

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`process(pdf_bytes, filename=None, **kwargs) -> str`

`process_from_path(pdf_path, save_result=False, output_path=None, **kwargs) -> str`

`process_from_bytes(pdf_bytes, filename=None, **kwargs) -> str`

`process_from_file(file_obj, filename=None, **kwargs) -> str`

`get(pdf_bytes, filename=None) -> Optional[str]`

`set(pdf_bytes, text, filename=None) -> str`

`clear(older_than_days=None)`

`get_stats() -> dict`

`FileHash`

`retry`

`ErrorHandler`