抓取网页内容,提取文本和链接,转换为Markdown格式保存到本地。

Project description

page-extract

page-extract 是一个简单易用的命令行工具，可以自动抓取网页内容，提取文本、链接和邮箱地址，并转换为干净的 Markdown 格式保存到本地。无论是研究资料收集、内容归档还是数据分析，都能帮你快速完成网页内容提取。

功能特性

智能内容提取：自动识别网页主体内容，去除导航栏、页脚、脚本等干扰元素
HTML 转 Markdown：将网页 HTML 转换为干净易读的 Markdown 格式
链接汇总：提取页面所有唯一链接（自动转为绝对 URL）
邮箱提取：自动识别并提取页面中的邮箱地址（mailto 链接）
批量处理：支持一次处理多个 URL，带进度条显示
文件读取：可从文本文件批量读取 URL 列表（支持注释）
彩色输出：终端显示美观的进度条和状态信息
智能容错：请求失败时自动跳过，不影响其他 URL 处理

安装和环境配置

前置要求

Python 3.10 或更高版本
pip 包管理工具

安装步骤

直接从 PyPI 安装（发布后）

pip install page-extract

验证安装

安装完成后，在终端运行以下命令验证是否成功：

page --help

如果看到帮助信息，说明安装成功！

使用示例和代码片段

命令行使用方法

1. 抓取单个网页

page https://example.com

执行效果：

抓取 https://example.com 网页
自动提取标题、正文、链接和邮箱
保存到 output/ 目录下的 .md 文件

2. 抓取多个网页

page https://example.com https://example.org https://example.net

3. 指定输出目录

page https://example.com -o my_articles

这会将 Markdown 文件保存到 my_articles/ 目录（自动创建）。

4. 从文件读取 URL 列表

创建一个文本文件 urls.txt，每行一个 URL：

# 这是注释，会被忽略
https://example.com
https://example.org

# 下面的 URL 也会被处理
https://example.net

运行命令：

page -f urls.txt

5. 命令行参数 + 文件混合使用

page https://example.com -f urls.txt -o results

这会处理命令行指定的 URL 和文件中的所有 URL。

6. 自定义超时时间

如果网络较慢，可以增加超时时间（默认 30 秒）：

page https://example.com -t 60

7. 完整参数示例

page https://example.com https://example.org \
  -f urls.txt \
  -o my_output \
  -t 45

命令行参数说明

参数	简写	类型	默认值	说明
`urls`	无	URL 列表	无	要抓取的 URL，可以写多个
`--file`	`-f`	文件路径	无	从文件读取 URL 列表（每行一个，`#` 开头为注释）
`--output`	`-o`	目录路径	`output`	Markdown 文件保存目录
`--timeout`	`-t`	数字（秒）	`30.0`	请求超时时间
`--help`	无	无	无	显示帮助信息

输出示例

运行命令后，终端会显示类似以下内容：

page-extract  共 3 个URL待处理

... 抓取 https://example.com
OK https://example.com -> output/example_com.md

... 抓取 https://example.org
OK https://example.org -> output/example_org.md

... 抓取 https://invalid-url
FAIL https://invalid-url (HTTP 404)

完成: 成功 2，失败 1，输出目录: /path/to/output

生成的 Markdown 文件结构

每个生成的 .md 文件包含以下内容：

# 页面标题

> 来源: https://example.com

---

## 页面邮箱汇总

1. contact@example.com
2. support@example.com

---

（这里是网页正文的 Markdown 内容...）

---

## 页面链接汇总

1. https://example.com/page1
2. https://example.com/page2
3. https://example.org

Python 代码调用方式

如果你想在 Python 脚本中使用 page-extract，可以这样：

示例 1：基本用法

from page_extract.extractor import PageExtractor

# 创建抓取器实例
with PageExtractor() as extractor:
    # 抓取单个网页
    result = extractor.extract("https://example.com")
    
    # 检查是否成功
    if result.success:
        print(f"标题: {result.title}")
        print(f"链接数: {len(result.links)}")
        print(f"邮箱数: {len(result.emails)}")
        
        # 保存为 Markdown 文件
        from pathlib import Path
        filepath = extractor.save_markdown(result, Path("output"))
        print(f"已保存: {filepath}")
    else:
        print(f"抓取失败: {result.error}")

示例 2：自定义配置

from page_extract.extractor import PageExtractor
from pathlib import Path

# 自定义超时时间和 SSL 验证
with PageExtractor(timeout=60.0, verify_ssl=True) as extractor:
    result = extractor.extract("https://example.com")
    
    if result.success:
        # 保存到指定目录
        output_dir = Path("my_articles")
        filepath = extractor.save_markdown(result, output_dir)
        print(f"保存成功: {filepath}")

示例 3：批量处理

from page_extract.extractor import PageExtractor
from pathlib import Path

urls = [
    "https://example.com",
    "https://example.org",
    "https://example.net",
]

output_dir = Path("batch_output")

with PageExtractor(timeout=45.0) as extractor:
    for url in urls:
        print(f"正在抓取: {url}")
        result = extractor.extract(url)
        
        if result.success:
            filepath = extractor.save_markdown(result, output_dir)
            print(f"  ✓ 成功 -> {filepath}")
        else:
            print(f"  ✗ 失败: {result.error}")

示例 4：手动管理资源

如果不使用 with 语句，需要手动关闭：

from page_extract.extractor import PageExtractor

extractor = PageExtractor()
try:
    result = extractor.extract("https://example.com")
    print(result.markdown)
finally:
    extractor.close()  # 必须手动关闭，释放连接资源

API 接口说明

核心类：`PageExtractor`

网页内容抓取器，负责 HTTP 请求、HTML 解析、Markdown 转换。

初始化参数

PageExtractor(
    timeout: float = 30.0,        # 请求超时时间（秒）
    follow_redirects: bool = True, # 是否自动跟随重定向
    verify_ssl: bool = False      # 是否验证 SSL 证书
)

主要方法

`extract(url: str) -> ExtractResult`

抓取单个 URL，返回提取结果。

参数：

url (str)：要抓取的网页地址

返回：ExtractResult 对象，包含以下属性：

url (str)：抓取的 URL
title (str)：页面标题
markdown (str)：转换后的 Markdown 内容
links (list[str])：页面链接列表
emails (list[str])：邮箱地址列表
success (bool)：是否成功
error (str)：失败时的错误信息

示例：

result = extractor.extract("https://example.com")
print(result.title)
print(result.links)

`save_markdown(result: ExtractResult, output_dir: Path) -> Path`

将抓取结果保存为 Markdown 文件。

参数：

result (ExtractResult)：extract() 返回的结果对象
output_dir (Path)：输出目录路径

返回：保存的文件路径（Path 对象）

示例：

from pathlib import Path
filepath = extractor.save_markdown(result, Path("output"))

`fetch_html(url: str) -> str`

请求 URL 并返回 HTML 文本。

参数：

url (str)：网页地址

返回：HTML 字符串

异常：失败时抛出 httpx.HTTPError

`close() -> None`

关闭 HTTP 客户端，释放连接资源。

注意：使用 with 语句时会自动调用，无需手动关闭。

数据类：`ExtractResult`

存储单页抓取结果的数据容器。

@dataclass
class ExtractResult:
    url: str                    # 目标 URL
    title: str = ""             # 页面标题
    markdown: str = ""          # Markdown 内容
    links: list[str] = []       # 链接列表
    emails: list[str] = []      # 邮箱列表
    success: bool = True        # 是否成功
    error: str = ""             # 错误信息

静态工具方法

`extract_links(soup, base_url) -> list[str]`

从 BeautifulSoup 对象中提取所有唯一链接。

过滤规则：

跳过空链接
跳过 javascript:、mailto:、tel: 协议
跳过 # 开头的页内锚点
自动转为绝对 URL
去除 URL 片段标识符（# 后面的部分）
去重保序

`extract_emails(soup) -> list[str]`

从 BeautifulSoup 对象中提取 mailto: 链接中的邮箱地址。

`html_to_markdown(html, base_url) -> str`

将 HTML 转换为 Markdown，并将相对链接转为绝对链接。

处理流程：

转换所有相对链接为绝对链接
使用 ATX 标题风格（# 标题）
使用 - 作为列表符号
压缩 3 个及以上连续空行为 2 个

依赖项清单

核心依赖

依赖包	最低版本	用途
httpx	>= 0.27.0	HTTP 客户端，用于请求网页
beautifulsoup4	>= 4.12.0	HTML 解析器
markdownify	>= 0.14.1	HTML 转 Markdown 工具
typer	>= 0.12.0	命令行框架
rich	>= 13.0.0	终端美化和进度条

Python 版本

最低要求：Python 3.10
推荐版本：Python 3.11+

自动安装

使用 pip install -e . 或 pip install page-extract 时，所有依赖会自动安装。

贡献指南与许可证

贡献指南

欢迎提交 Issue 和 Pull Request！

开发环境搭建

# 1. 克隆仓库
git clone https://github.com/your-username/page-extract.git
cd page-extract

# 2. 创建虚拟环境（推荐）
python -m venv venv
source venv/bin/activate  # Linux/Mac
# 或
venv\Scripts\activate     # Windows

# 3. 安装开发依赖
pip install -e .

# 4. 运行测试
page --help

提交代码规范

Fork 本仓库
创建特性分支 (git checkout -b feature/amazing-feature)
提交更改 (git commit -m 'Add some amazing feature')
推送到分支 (git push origin feature/amazing-feature)
提交 Pull Request

代码风格

遵循 PEP 8 规范
使用类型注解
添加必要的文档字符串
保持代码简洁易读

已知限制

无 JS 渲染：使用 httpx 直接请求 HTML，不执行 JavaScript，SPA（单页应用）页面可能获取到空壳内容
SSL 验证默认关闭：verify_ssl=False，生产环境建议开启
干扰标签移除：<nav> 和 <footer> 中的内容和链接会被完全丢弃
邮箱提取限制：仅从 mailto: 协议的 <a> 标签中提取，不会扫描纯文本

许可证

本项目采用 MIT 许可证。你可以自由使用、修改和分发此软件。

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_extract-0.1.0.tar.gz (9.4 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

page_extract-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file page_extract-0.1.0.tar.gz.

File metadata

Download URL: page_extract-0.1.0.tar.gz
Upload date: May 28, 2026
Size: 9.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for page_extract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`668fd152115d8fbd781da1f988417a871c5066a0f8eff094dfefbeef4146c7f6`
MD5	`71afb0627b8a10b4456e7dff276ddd73`
BLAKE2b-256	`d8bc193797161c6fae01866373c85b73b7b368bbead673cc1b0b82b55db1b5ea`

See more details on using hashes here.

File details

Details for the file page_extract-0.1.0-py3-none-any.whl.

File metadata

Download URL: page_extract-0.1.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 10.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for page_extract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`074705dca670fec932cf94398ec8bb3dfdbfea1480218a8bc9d84b27e1f07803`
MD5	`9b687f28c1b654ecbd7b0350373e54ef`
BLAKE2b-256	`24957028ef800017f4a0066f5bc624028d7106caed76199c36c28f324d40f6e6`

See more details on using hashes here.

page-extract 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

page-extract

功能特性

安装和环境配置

前置要求

安装步骤

直接从 PyPI 安装（发布后）

验证安装

使用示例和代码片段

命令行使用方法

1. 抓取单个网页

2. 抓取多个网页

3. 指定输出目录

4. 从文件读取 URL 列表

5. 命令行参数 + 文件混合使用

6. 自定义超时时间

7. 完整参数示例

命令行参数说明

输出示例

生成的 Markdown 文件结构

Python 代码调用方式

示例 1：基本用法

示例 2：自定义配置

示例 3：批量处理

示例 4：手动管理资源

API 接口说明

核心类：PageExtractor

初始化参数

主要方法

extract(url: str) -> ExtractResult

save_markdown(result: ExtractResult, output_dir: Path) -> Path

fetch_html(url: str) -> str

close() -> None

数据类：ExtractResult

静态工具方法

extract_links(soup, base_url) -> list[str]

extract_emails(soup) -> list[str]

html_to_markdown(html, base_url) -> str

依赖项清单

核心依赖

Python 版本

自动安装

贡献指南与许可证

贡献指南

开发环境搭建

提交代码规范

代码风格

已知限制

许可证

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

核心类：`PageExtractor`

`extract(url: str) -> ExtractResult`

`save_markdown(result: ExtractResult, output_dir: Path) -> Path`

`fetch_html(url: str) -> str`

`close() -> None`

数据类：`ExtractResult`

`extract_links(soup, base_url) -> list[str]`

`extract_emails(soup) -> list[str]`

`html_to_markdown(html, base_url) -> str`