Python wrapper for antiword with bundled binary and data files
Project description
doc2txt
一个用于从Microsoft Word文档中提取文本的Python包,基于antiword工具构建,内置了跨平台的二进制文件和数据文件。
功能特性
- 从.doc格式的Microsoft Word文档中提取纯文本
- 跨平台支持(Windows、Linux、macOS ARM64)
- 内置antiword二进制文件,无需额外安装
- 文本格式优化功能,自动处理换行和表格
- 简单易用的Python API
支持的平台
- Windows (AMD64)
- Linux (AMD64)
- macOS (ARM64/Apple Silicon)
注意:macOS Intel (x86_64) 暂不支持
安装
pip install doc2txt
快速开始
基本用法
from doc2txt import extract_text
# 从Word文档提取文本
text = extract_text('document.doc')
print(text)
启用文本格式优化
from doc2txt import extract_text
# 提取文本并优化格式(合并断行,处理表格)
text = extract_text('document.doc', optimize_format=True)
print(text)
使用文本优化工具
from doc2txt import extract_text, optimize_text
# 先提取原始文本
raw_text = extract_text('document.doc')
# 手动优化文本格式
optimized_text = optimize_text(raw_text)
print(optimized_text)
API 参考
extract_text(doc_path, optimize_format=False)
从Microsoft Word文档中提取文本。
参数:
doc_path(str): .doc文件的路径optimize_format(bool): 是否优化文本格式,默认为False
返回:
str: 从文档中提取的文本内容
异常:
FileNotFoundError: 文件不存在ValueError: 文件格式不支持(仅支持.doc格式)RuntimeError: 平台不支持或二进制文件缺失、文档解析失败
optimize_text(text)
优化从文档中提取的文本格式。
参数:
text(str): 从文档中提取的原始文本
返回:
str: 格式优化后的文本
文本优化功能
文本优化功能解决了从Word文档提取文本时常见的格式问题:
- 智能语言检测: 使用fast-langdetect自动识别中日韩(CJK)语言
- 换行合并: 自动合并没有缩进的连续行,保持段落的完整性
- CJK优化: 中日韩文本合并时不添加空格,其他语言添加空格
- 表格处理: 智能识别表格行(包含
|分隔符),保持表格格式 - 段落识别: 智能识别段落开头、列表项、标题等结构
- 空格处理: 移除行首多余空格,保持文档的清洁格式
项目结构
doc2txt/
├── __init__.py # 包的主入口
├── antiword_wrapper.py # antiword工具的Python封装
├── text_optimizer.py # 文本格式优化工具
├── bin/ # 跨平台二进制文件
│ ├── darwin-arm64/
│ ├── linux-amd64/
│ └── win-amd64/
└── antiword_share/ # antiword数据文件
├── fontnames
└── *.txt # 字符编码映射文件
依赖要求
chardet>=5.2.0- 字符编码检测fast-langdetect>=0.4.6- 快速语言检测
许可证
MIT License
贡献
欢迎提交Issue和Pull Request来改进这个项目。
更新日志
1.0.7
- 替换 langdetect 为 fast-langdetect,提升80倍性能
- 改进CJK语言检测准确性
- 增强错误处理和输入验证
- 优化字符编码检测
- 添加完整的测试套件
1.0.6
- 更新版本号用于PyPI发布
1.0.5
- 改进文本优化逻辑
1.0.0
- 初始版本发布
- 支持从.doc文件提取文本
- 内置跨平台antiword二进制文件
- 文本格式优化功能
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc2txt-1.0.7.tar.gz.
File metadata
- Download URL: doc2txt-1.0.7.tar.gz
- Upload date:
- Size: 354.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16ac1f9af74106ed4f1053a140ab54b01c2128f5defeb2e6ab02784306101688
|
|
| MD5 |
21f740ce2124b6d012429fd6c6801f29
|
|
| BLAKE2b-256 |
0bb8d5915fde7f830fdda0aa69d384a93ab54cb882490dc0fc19c588d14fe3c5
|
Provenance
The following attestation bundles were made for doc2txt-1.0.7.tar.gz:
Publisher:
publish.yml on Quantatirsk/doc2txt-pypi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doc2txt-1.0.7.tar.gz -
Subject digest:
16ac1f9af74106ed4f1053a140ab54b01c2128f5defeb2e6ab02784306101688 - Sigstore transparency entry: 315072623
- Sigstore integration time:
-
Permalink:
Quantatirsk/doc2txt-pypi@64d9fd832b8701ebdebfc7949b4df657404da637 -
Branch / Tag:
refs/tags/v1.0.7 - Owner: https://github.com/Quantatirsk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@64d9fd832b8701ebdebfc7949b4df657404da637 -
Trigger Event:
release
-
Statement type:
File details
Details for the file doc2txt-1.0.7-py3-none-any.whl.
File metadata
- Download URL: doc2txt-1.0.7-py3-none-any.whl
- Upload date:
- Size: 413.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4224631260345fd938e143af5cc9dc5afa97150613184271213d073c40bd99cc
|
|
| MD5 |
fb55deb7ef01bb21f3ca7676c2d93991
|
|
| BLAKE2b-256 |
007b29c47849e4a232148edad543992b50393c3776b751521cc083a5263e4243
|
Provenance
The following attestation bundles were made for doc2txt-1.0.7-py3-none-any.whl:
Publisher:
publish.yml on Quantatirsk/doc2txt-pypi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doc2txt-1.0.7-py3-none-any.whl -
Subject digest:
4224631260345fd938e143af5cc9dc5afa97150613184271213d073c40bd99cc - Sigstore transparency entry: 315072633
- Sigstore integration time:
-
Permalink:
Quantatirsk/doc2txt-pypi@64d9fd832b8701ebdebfc7949b4df657404da637 -
Branch / Tag:
refs/tags/v1.0.7 - Owner: https://github.com/Quantatirsk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@64d9fd832b8701ebdebfc7949b4df657404da637 -
Trigger Event:
release
-
Statement type: