Skip to main content

Turn any SKILL.md into a runnable AI Agent

Project description

PDF类型判断与文本提取工具

快速开始

1️⃣ 判断PDF类型

python check_pdf_type.py scan_report.pdf

输出示例:

==================================================
📄 正在分析: scan_report.pdf
==================================================

  第 1页 | 文本:    0字 | 图片:1张 🖼️
  第 2页 | 文本:    0字 | 图片:1张 🖼️
  第 3页 | 文本:    0字 | 图片:1张 🖼️

──────────────────────────────────────────────────
🔍 判断结果
──────────────────────────────────────────────────

🖼️ 结论:这是【扫描件PDF】
   理由:文本极少(0字),页面包含 3 张扫描图片

📌 推荐提取方案:
   工具: OCR(光学字符识别)
   方案一(pytesseract): ...

2️⃣ 提取文本

自动模式(智能判断类型):

python extract_pdf_text.py scan_report.pdf

强制OCR模式

python extract_pdf_text.py scan_report.pdf --ocr

指定语言(如纯英文):

python extract_pdf_text.py scan_report.pdf --lang eng

3️⃣ 安装依赖

# 基础
pip install pypdf pdfplumber

# OCR(扫描件需要)
pip install pdf2image pytesseract easyocr

# 系统依赖
# Ubuntu: sudo apt install tesseract-ocr tesseract-ocr-chi-sim poppler-utils
# macOS:  brew install tesseract poppler
# Windows: 下载安装 Tesseract OCR + poppler

判断方法速查表

方法 命令/代码 原生PDF特征 扫描件特征
Python检查 check_pdf_type.py 文本>100字 文本≈0, 图片多
命令行 pdftotext file.pdf - | head 输出正常文字 输出空或乱码
手动 打开PDF尝试选中文字 ✅ 可以选中复制 ❌ 选不中
手动 Ctrl+F搜索 ✅ 能搜到 ❌ 搜不到

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenthatch-0.8.0.tar.gz (185.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agenthatch-0.8.0-py3-none-any.whl (228.4 kB view details)

Uploaded Python 3

File details

Details for the file agenthatch-0.8.0.tar.gz.

File metadata

  • Download URL: agenthatch-0.8.0.tar.gz
  • Upload date:
  • Size: 185.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agenthatch-0.8.0.tar.gz
Algorithm Hash digest
SHA256 d63adc9c83108843e82aa9e1a11924a6fbd65d12ca969ab1b085ce1eb517fe3d
MD5 2f0aa2cb39826ae7bbda7435d36e1958
BLAKE2b-256 cb9f7ff7542aabbc9e4ed0dce84a89900d1c91bb7c9d34085b6222fc66cd9837

See more details on using hashes here.

File details

Details for the file agenthatch-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: agenthatch-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 228.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agenthatch-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e9874e8cae8da36c1a4ffd9b816328ce33278a8174578607fff1703498afe3d
MD5 94351c031bc2d6470a9679393b550727
BLAKE2b-256 c267cd3de2c9b6dc36e5e4e25371cd6c4bdb084b933771cc0d8de989dee6c946

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page