Turn any SKILL.md into a runnable AI Agent
Project description
PDF类型判断与文本提取工具
快速开始
1️⃣ 判断PDF类型
python check_pdf_type.py scan_report.pdf
输出示例:
==================================================
📄 正在分析: scan_report.pdf
==================================================
第 1页 | 文本: 0字 | 图片:1张 🖼️
第 2页 | 文本: 0字 | 图片:1张 🖼️
第 3页 | 文本: 0字 | 图片:1张 🖼️
──────────────────────────────────────────────────
🔍 判断结果
──────────────────────────────────────────────────
🖼️ 结论:这是【扫描件PDF】
理由:文本极少(0字),页面包含 3 张扫描图片
📌 推荐提取方案:
工具: OCR(光学字符识别)
方案一(pytesseract): ...
2️⃣ 提取文本
自动模式(智能判断类型):
python extract_pdf_text.py scan_report.pdf
强制OCR模式:
python extract_pdf_text.py scan_report.pdf --ocr
指定语言(如纯英文):
python extract_pdf_text.py scan_report.pdf --lang eng
3️⃣ 安装依赖
# 基础
pip install pypdf pdfplumber
# OCR(扫描件需要)
pip install pdf2image pytesseract easyocr
# 系统依赖
# Ubuntu: sudo apt install tesseract-ocr tesseract-ocr-chi-sim poppler-utils
# macOS: brew install tesseract poppler
# Windows: 下载安装 Tesseract OCR + poppler
判断方法速查表
| 方法 | 命令/代码 | 原生PDF特征 | 扫描件特征 |
|---|---|---|---|
| Python检查 | check_pdf_type.py |
文本>100字 | 文本≈0, 图片多 |
| 命令行 | pdftotext file.pdf - | head |
输出正常文字 | 输出空或乱码 |
| 手动 | 打开PDF尝试选中文字 | ✅ 可以选中复制 | ❌ 选不中 |
| 手动 | Ctrl+F搜索 | ✅ 能搜到 | ❌ 搜不到 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
agenthatch-0.8.1.tar.gz
(186.9 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
agenthatch-0.8.1-py3-none-any.whl
(230.3 kB
view details)
File details
Details for the file agenthatch-0.8.1.tar.gz.
File metadata
- Download URL: agenthatch-0.8.1.tar.gz
- Upload date:
- Size: 186.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b9b226a2ac2ba17decbbf6284d6ebefeb9461a400d30c095b536a1870b8c425
|
|
| MD5 |
b6fd047aa16b063158a260c206f93d7a
|
|
| BLAKE2b-256 |
6073e5a4c4ef1465f1584736499ba5d40c059eb711710b5f4194b2684f5b499c
|
File details
Details for the file agenthatch-0.8.1-py3-none-any.whl.
File metadata
- Download URL: agenthatch-0.8.1-py3-none-any.whl
- Upload date:
- Size: 230.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e9625a81f4b2afa348a7f9cc6b016836c5f5d6b42ed3fcb7f3dda2e46cc756c
|
|
| MD5 |
3fed68032ff5d03477202edd7e7bac64
|
|
| BLAKE2b-256 |
c7183ace257a35c5cd5ca161a4abf419d4751e94b0c887f9cb829d2ed03ac27e
|