Skip to main content

Turn any SKILL.md into a runnable AI Agent

Project description

PDF类型判断与文本提取工具

快速开始

1️⃣ 判断PDF类型

python check_pdf_type.py scan_report.pdf

输出示例:

==================================================
📄 正在分析: scan_report.pdf
==================================================

  第 1页 | 文本:    0字 | 图片:1张 🖼️
  第 2页 | 文本:    0字 | 图片:1张 🖼️
  第 3页 | 文本:    0字 | 图片:1张 🖼️

──────────────────────────────────────────────────
🔍 判断结果
──────────────────────────────────────────────────

🖼️ 结论:这是【扫描件PDF】
   理由:文本极少(0字),页面包含 3 张扫描图片

📌 推荐提取方案:
   工具: OCR(光学字符识别)
   方案一(pytesseract): ...

2️⃣ 提取文本

自动模式(智能判断类型):

python extract_pdf_text.py scan_report.pdf

强制OCR模式

python extract_pdf_text.py scan_report.pdf --ocr

指定语言(如纯英文):

python extract_pdf_text.py scan_report.pdf --lang eng

3️⃣ 安装依赖

# 基础
pip install pypdf pdfplumber

# OCR(扫描件需要)
pip install pdf2image pytesseract easyocr

# 系统依赖
# Ubuntu: sudo apt install tesseract-ocr tesseract-ocr-chi-sim poppler-utils
# macOS:  brew install tesseract poppler
# Windows: 下载安装 Tesseract OCR + poppler

判断方法速查表

方法 命令/代码 原生PDF特征 扫描件特征
Python检查 check_pdf_type.py 文本>100字 文本≈0, 图片多
命令行 pdftotext file.pdf - | head 输出正常文字 输出空或乱码
手动 打开PDF尝试选中文字 ✅ 可以选中复制 ❌ 选不中
手动 Ctrl+F搜索 ✅ 能搜到 ❌ 搜不到

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenthatch-0.8.1.tar.gz (186.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agenthatch-0.8.1-py3-none-any.whl (230.3 kB view details)

Uploaded Python 3

File details

Details for the file agenthatch-0.8.1.tar.gz.

File metadata

  • Download URL: agenthatch-0.8.1.tar.gz
  • Upload date:
  • Size: 186.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agenthatch-0.8.1.tar.gz
Algorithm Hash digest
SHA256 4b9b226a2ac2ba17decbbf6284d6ebefeb9461a400d30c095b536a1870b8c425
MD5 b6fd047aa16b063158a260c206f93d7a
BLAKE2b-256 6073e5a4c4ef1465f1584736499ba5d40c059eb711710b5f4194b2684f5b499c

See more details on using hashes here.

File details

Details for the file agenthatch-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: agenthatch-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 230.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agenthatch-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e9625a81f4b2afa348a7f9cc6b016836c5f5d6b42ed3fcb7f3dda2e46cc756c
MD5 3fed68032ff5d03477202edd7e7bac64
BLAKE2b-256 c7183ace257a35c5cd5ca161a4abf419d4751e94b0c887f9cb829d2ed03ac27e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page