Skip to main content

Python wrapper for antiword with bundled binary and data files

Project description

doc2txt

一个用于从Microsoft Word文档中提取文本的Python包,基于antiword工具构建,内置了跨平台的二进制文件和数据文件。

功能特性

  • 从.doc格式的Microsoft Word文档中提取纯文本
  • 跨平台支持(Windows、Linux、macOS ARM64)
  • 内置antiword二进制文件,无需额外安装
  • 文本格式优化功能,自动处理换行和表格
  • 简单易用的Python API

支持的平台

  • Windows (AMD64)
  • Linux (AMD64)
  • macOS (ARM64/Apple Silicon)

注意:macOS Intel (x86_64) 暂不支持

安装

pip install doc2txt

快速开始

基本用法

from doc2txt import extract_text

# 从Word文档提取文本
text = extract_text('document.doc')
print(text)

启用文本格式优化

from doc2txt import extract_text

# 提取文本并优化格式(合并断行,处理表格)
text = extract_text('document.doc', optimize_format=True)
print(text)

使用文本优化工具

from doc2txt import extract_text, optimize_text

# 先提取原始文本
raw_text = extract_text('document.doc')

# 手动优化文本格式
optimized_text = optimize_text(raw_text)
print(optimized_text)

API 参考

extract_text(doc_path, optimize_format=False)

从Microsoft Word文档中提取文本。

参数:

  • doc_path (str): .doc文件的路径
  • optimize_format (bool): 是否优化文本格式,默认为False

返回:

  • str: 从文档中提取的文本内容

异常:

  • RuntimeError: 平台不支持或二进制文件缺失
  • subprocess.CalledProcessError: antiword执行失败

optimize_text(text)

优化从文档中提取的文本格式。

参数:

  • text (str): 从文档中提取的原始文本

返回:

  • str: 格式优化后的文本

文本优化功能

文本优化功能解决了从Word文档提取文本时常见的格式问题:

  • 换行合并: 自动合并没有缩进的连续行,保持段落的完整性
  • 表格处理: 智能识别表格行(包含|分隔符),保持表格格式
  • 空格处理: 移除行首多余空格,保持文档的清洁格式

项目结构

doc2txt/
├── __init__.py              # 包的主入口
├── antiword_wrapper.py      # antiword工具的Python封装
├── text_optimizer.py       # 文本格式优化工具
├── bin/                     # 跨平台二进制文件
│   ├── darwin-arm64/
│   ├── linux-amd64/
│   └── win-amd64/
└── antiword_share/          # antiword数据文件
    ├── fontnames
    └── *.txt                # 字符编码映射文件

依赖要求

本包无外部依赖,所有必需的工具和数据文件都已内置。

许可证

MIT License

贡献

欢迎提交Issue和Pull Request来改进这个项目。

更新日志

1.0.0

  • 初始版本发布
  • 支持从.doc文件提取文本
  • 内置跨平台antiword二进制文件
  • 文本格式优化功能

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc2txt-1.0.6.tar.gz (350.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc2txt-1.0.6-py3-none-any.whl (412.8 kB view details)

Uploaded Python 3

File details

Details for the file doc2txt-1.0.6.tar.gz.

File metadata

  • Download URL: doc2txt-1.0.6.tar.gz
  • Upload date:
  • Size: 350.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for doc2txt-1.0.6.tar.gz
Algorithm Hash digest
SHA256 1d4f1c987f20bda93fe3a8301595968d9be42a55881b55ae2cff198aee3f9f03
MD5 61841ddac3c405629ab1b75ff588117f
BLAKE2b-256 e2737a88b1f67f1bbdaa386f8bb4f544e302934e3e7f409f7b05ab9723fb9b6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc2txt-1.0.6.tar.gz:

Publisher: publish.yml on Quantatirsk/doc2txt-pypi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file doc2txt-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: doc2txt-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 412.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for doc2txt-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9f1792140229e63daa29bc7e4ba6f7fb3c134b8e864bf73c6ff9f8196ca684b3
MD5 ac4cfcc8a31e94bec42f768d0625d10d
BLAKE2b-256 ed8909c5fce2a263e3b55d6ce7d3f19fafd0a33ce46b17fc3548d3a5093ad3dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc2txt-1.0.6-py3-none-any.whl:

Publisher: publish.yml on Quantatirsk/doc2txt-pypi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page