Turn any SKILL.md into a runnable AI Agent
Project description
PDF类型判断与文本提取工具
快速开始
1️⃣ 判断PDF类型
python check_pdf_type.py scan_report.pdf
输出示例:
==================================================
📄 正在分析: scan_report.pdf
==================================================
第 1页 | 文本: 0字 | 图片:1张 🖼️
第 2页 | 文本: 0字 | 图片:1张 🖼️
第 3页 | 文本: 0字 | 图片:1张 🖼️
──────────────────────────────────────────────────
🔍 判断结果
──────────────────────────────────────────────────
🖼️ 结论:这是【扫描件PDF】
理由:文本极少(0字),页面包含 3 张扫描图片
📌 推荐提取方案:
工具: OCR(光学字符识别)
方案一(pytesseract): ...
2️⃣ 提取文本
自动模式(智能判断类型):
python extract_pdf_text.py scan_report.pdf
强制OCR模式:
python extract_pdf_text.py scan_report.pdf --ocr
指定语言(如纯英文):
python extract_pdf_text.py scan_report.pdf --lang eng
3️⃣ 安装依赖
# 基础
pip install pypdf pdfplumber
# OCR(扫描件需要)
pip install pdf2image pytesseract easyocr
# 系统依赖
# Ubuntu: sudo apt install tesseract-ocr tesseract-ocr-chi-sim poppler-utils
# macOS: brew install tesseract poppler
# Windows: 下载安装 Tesseract OCR + poppler
判断方法速查表
| 方法 | 命令/代码 | 原生PDF特征 | 扫描件特征 |
|---|---|---|---|
| Python检查 | check_pdf_type.py |
文本>100字 | 文本≈0, 图片多 |
| 命令行 | pdftotext file.pdf - | head |
输出正常文字 | 输出空或乱码 |
| 手动 | 打开PDF尝试选中文字 | ✅ 可以选中复制 | ❌ 选不中 |
| 手动 | Ctrl+F搜索 | ✅ 能搜到 | ❌ 搜不到 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
agenthatch-0.8.0.tar.gz
(185.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
agenthatch-0.8.0-py3-none-any.whl
(228.4 kB
view details)
File details
Details for the file agenthatch-0.8.0.tar.gz.
File metadata
- Download URL: agenthatch-0.8.0.tar.gz
- Upload date:
- Size: 185.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d63adc9c83108843e82aa9e1a11924a6fbd65d12ca969ab1b085ce1eb517fe3d
|
|
| MD5 |
2f0aa2cb39826ae7bbda7435d36e1958
|
|
| BLAKE2b-256 |
cb9f7ff7542aabbc9e4ed0dce84a89900d1c91bb7c9d34085b6222fc66cd9837
|
File details
Details for the file agenthatch-0.8.0-py3-none-any.whl.
File metadata
- Download URL: agenthatch-0.8.0-py3-none-any.whl
- Upload date:
- Size: 228.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e9874e8cae8da36c1a4ffd9b816328ce33278a8174578607fff1703498afe3d
|
|
| MD5 |
94351c031bc2d6470a9679393b550727
|
|
| BLAKE2b-256 |
c267cd3de2c9b6dc36e5e4e25371cd6c4bdb084b933771cc0d8de989dee6c946
|