Smart web crawling and search for OpenClaw

These details have not been verified by PyPI

Project description

Crawl4AI Skill

智能搜索与爬取工具 | 支持登录态爬取 Twitter/X、小红书

缘起

在使用 AI 助手处理信息时，我经常需要爬取网页内容。尝试了很多方案后，遇到了 crawl4ai —— 一个专为 LLM 设计的爬虫引擎，它的 Fit Markdown 输出简直是为 AI 量身定做的，去除了所有冗余内容，只保留核心信息。

但在实际使用中，我遇到了一个痛点：很多有价值的内容需要登录才能访问。

Twitter/X 上的推文、小红书的笔记... 这些平台的反爬措施很严格，普通的爬虫根本无法获取登录后的内容。crawl4ai 的 storage_state 参数理论上支持 Cookie 注入，但在 Twitter 等平台上会被反自动化检测拦截。

经过反复尝试，我找到了一个可行的方案：Playwright 持久化上下文 + crawl4ai Markdown 生成器。用 Playwright 绕过反检测加载页面，再用 crawl4ai 的强大能力转换为干净的 Markdown。

这个项目就是这些探索的成果。希望能帮助到有同样需求的朋友。

特性

🔍 DuckDuckGo 搜索 - 免 API key，快速搜索
🕷️ 智能爬取 - 自动识别 sitemap、递归爬取
📝 LLM 优化输出 - Fit Markdown，节省 token
🔐 登录态爬取 - 支持 Twitter/X、小红书
🐦 推文提取 - 支持引用推文 (Quote Tweet)
🛡️ 反检测 - Playwright Stealth 模式

安装

从源码安装（可检查代码）

git clone https://github.com/lancelin111/crawl4ai-skill.git
cd crawl4ai-skill
pip install -e .
python -m playwright install chromium

快速开始

搜索

crawl4ai-skill search "python web scraping"

爬取网页

crawl4ai-skill crawl https://example.com -o page.md

爬取整站

crawl4ai-skill crawl-site https://docs.example.com --max-pages 50

搜索并爬取

crawl4ai-skill search-and-crawl "AI tutorials" --crawl-top 3

登录态爬取

这是本项目的核心功能 —— 爬取需要登录的页面。

第一步：登录

获取 Twitter Cookie:

在浏览器中登录 Twitter
打开开发者工具 (F12) → Application → Cookies
复制 auth_token 和 ct0 的值

登录方式（按安全性排序）：

# 方式 1: 环境变量（推荐，不记录在 shell history）
export TWITTER_COOKIES="auth_token=xxx; ct0=yyy"
crawl4ai-skill login twitter

# 方式 2: 交互式输入（输入时不显示）
crawl4ai-skill login twitter --interactive

# 方式 3: 从文件读取（需 chmod 600）
echo "auth_token=xxx; ct0=yyy" > ~/.twitter-cookies
chmod 600 ~/.twitter-cookies
crawl4ai-skill login twitter --cookies-file ~/.twitter-cookies

# 方式 4: 命令行参数（不推荐，会记录在 shell history）
crawl4ai-skill login twitter --cookies "auth_token=xxx; ct0=yyy"

小红书（扫码方式）：

crawl4ai-skill login xiaohongshu
# 会打开浏览器，用 App 扫码登录

第二步：爬取

# 爬取 Twitter 用户页面
crawl4ai-skill crawl-with-login https://x.com/elonmusk -p twitter

# 提取推文（包含引用推文）
crawl4ai-skill crawl-with-login https://x.com/elonmusk -p twitter --extract-tweets

# 爬取小红书笔记
crawl4ai-skill crawl-with-login https://www.xiaohongshu.com/explore/xxx -p xiaohongshu

查看登录状态

crawl4ai-skill session-status

清除登录信息

crawl4ai-skill session-clear twitter
crawl4ai-skill session-clear --all

命令参考

命令	说明
`search <query>`	搜索网页
`crawl <url>`	爬取单页
`crawl-site <url>`	爬取全站
`search-and-crawl <query>`	搜索并爬取
`login <platform>`	登录平台
`crawl-with-login <url>`	登录态爬取
`session-status`	查看登录状态
`session-clear [platform]`	清除登录信息

输出格式

格式	说明
`fit_markdown`	优化后的 Markdown，去除冗余（推荐）
`markdown_with_citations`	带引用列表，便于溯源
`raw_markdown`	原始 Markdown

常见问题

Twitter 爬取显示未登录？

确保使用的是 x.com 而不是 twitter.com，Cookie 域名绑定在 .x.com。

小红书扫码后无响应？

扫码后需要在 App 中点击确认登录。

Playwright 浏览器问题？

python -m playwright install chromium --with-deps

安全说明

代码透明度

✅ 本项目完全开源，所有代码可在 GitHub 审查
✅ 推荐先克隆仓库审查代码，再安装
✅ 可使用 bandit 工具扫描安全问题

凭据存储位置

数据	存储路径	说明
Session Cookies	`~/.crawl4ai-skill/sessions/<platform>_session.enc`	AES-128 加密存储
浏览器数据	`~/.crawl4ai-skill/browser_data/<platform>/`	Playwright 持久化上下文
加密密钥	基于机器标识符派生	无法在其他机器解密

所有数据仅存储在本地，绝不会传输到任何外部服务器。

加密存储

v0.2.0 起，Session Cookies 默认使用 AES-128-CBC 加密存储：

密钥基于机器标识符（MAC 地址、主机名等）派生
加密后的 Session 文件无法在其他机器上解密
文件权限设置为 600（仅用户可读写）
查看加密状态：crawl4ai-skill session-status

凭据输入方式

方式	安全性	说明
环境变量	⭐⭐⭐	推荐，不记录在 shell history
交互式输入	⭐⭐⭐	输入时不显示
文件读取	⭐⭐	需设置 chmod 600
命令行参数	⭐	不推荐，会记录在 shell history

安全建议

审查代码 - 安装前建议先 clone 仓库检查源码
使用测试账号 - 建议使用非主力账号进行登录
使用环境变量 - 通过 export TWITTER_COOKIES=... 传递凭据
及时清理 - 使用完毕后执行 crawl4ai-skill session-clear --all
限制权限 - 确保 ~/.crawl4ai-skill 目录权限安全

责任声明

本工具仅供学习和研究使用
使用者需自行承担法律责任
作者不对数据安全问题负责

安装方式选择

# 推荐：从 PyPI 安装
pip install crawl4ai-skill
python -m playwright install chromium

# 或从源码安装（可审查代码）
git clone https://github.com/lancelin111/crawl4ai-skill.git
cd crawl4ai-skill
pip install -e .
python -m playwright install chromium

致谢

这个项目的诞生，离不开以下优秀的开源项目：

crawl4ai ⭐

一个真正为 LLM 设计的爬虫引擎。

当我第一次看到 crawl4ai 的 Fit Markdown 输出时，我被震撼了。它不是简单地把 HTML 转成 Markdown，而是智能地提取核心内容，去除导航、广告、侧边栏等噪音。这正是 AI 需要的输入格式 —— 干净、精炼、直击要点。

crawl4ai 的 PruningContentFilter 和 DefaultMarkdownGenerator 是本项目 Markdown 生成的核心。感谢 @unclecode 创造了这个强大的工具。

Playwright

微软出品的浏览器自动化工具。本项目使用 Playwright 的持久化上下文来维护登录状态，绕过反自动化检测。它的稳定性和跨平台支持是项目可靠运行的基础。

playwright-stealth

帮助 Playwright 绕过反自动化检测的关键组件。没有它，登录态爬取根本不可能实现。

duckduckgo-search

免 API key 的搜索能力来自这个项目。简单、可靠、无需注册。

如果这个项目对你有帮助，请给上面这些项目一个 Star ⭐

它们才是真正的英雄。

License

MIT License

作者

@lancelin

Built with ❤️ and open source

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Mar 10, 2026

0.1.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl4ai_skill-0.2.0.tar.gz (42.1 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawl4ai_skill-0.2.0-py3-none-any.whl (44.0 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file crawl4ai_skill-0.2.0.tar.gz.

File metadata

Download URL: crawl4ai_skill-0.2.0.tar.gz
Upload date: Mar 10, 2026
Size: 42.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for crawl4ai_skill-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`850a71b1bfcd7d46586c74667474e65778c2d58cd18f615cc35ae74d21124352`
MD5	`3a8570ab75ce51a2fe9deed2467d8f9b`
BLAKE2b-256	`c6eef35f6d71b48e2117ce62f177bc17aefa6666133ffbccdb205af53c00dd03`

See more details on using hashes here.

File details

Details for the file crawl4ai_skill-0.2.0-py3-none-any.whl.

File metadata

Download URL: crawl4ai_skill-0.2.0-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 44.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for crawl4ai_skill-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d66906915c68413b4606d6568aec5f4f67e6ed00c60a21fb219c84020545b35`
MD5	`1961a949c95c1beaf1fd9c19ac610dec`
BLAKE2b-256	`e3dabcb4b8ab4dacc215ac9af1c14bd9fd2d5dc84a68207b84486e55e1e77fe2`

See more details on using hashes here.

crawl4ai-skill 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Crawl4AI Skill

缘起

特性

安装

推荐方式（PyPI）

从源码安装（可检查代码）

快速开始

搜索

爬取网页

爬取整站

搜索并爬取

登录态爬取

第一步：登录

第二步：爬取

查看登录状态

清除登录信息

命令参考

输出格式

常见问题

Twitter 爬取显示未登录？

小红书扫码后无响应？

Playwright 浏览器问题？

安全说明

代码透明度

凭据存储位置

加密存储

凭据输入方式

安全建议

责任声明

安装方式选择

致谢

crawl4ai ⭐

Playwright

playwright-stealth

duckduckgo-search

License

作者

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes