Extract Xiaohongshu (RED) notes as faithful Markdown — fetch + vision OCR + heuristic page stitching, no LLM rewriting.
Reason this release was yanked:
Contained hardcoded invite codes; superseded by 0.4.24
Project description
XHSExtractor
小红书(RED)链接 → 逐字还原的 Markdown 文章。
抓取 + 图片 OCR + 启发式跨页拼接,一条命令搞定。
✨ 特性
- 🔗 支持
xhslink.com短链和xiaohongshu.com/.../<note_id>长链 - 🖼 自动下载所有图片,并发 OCR(通过 OpenAI 兼容端点调
gpt-4o) - 🧷 逐字还原:不总结、不改写、不润色 —— 图里是啥,输出就是啥
- 🪄 启发式跨页拼接:识别句子在页底被截断的情况自动粘接,列表/段落边界保留换行
- 📝 输出结构化 Markdown(H1 标题 / 作者 / 摘要 / 正文 / 标签 / 原图画廊)
- 📦 也可输出 JSON,方便下游程序处理
🚀 快速开始
pip install xhsx
playwright install chromium # 首次需要装 headless 浏览器
# 设置 OpenAI 兼容端点(任选一个)
export COPILOTX_BASE_URL="https://api.openai.com/v1"
export COPILOTX_API_KEY="sk-..."
xhsx "https://xhslink.com/xxxxx" # 终端预览
xhsx "https://xhslink.com/xxxxx" -o note.md # 写入 Markdown
xhsx "https://xhslink.com/xxxxx" --json > out.json
xhsx "https://xhslink.com/xxxxx" --headed # 调试时显示浏览器
⚙️ 环境变量
| 变量 | 默认 | 说明 |
|---|---|---|
COPILOTX_BASE_URL |
https://api.polly.wang/v1 |
OpenAI 兼容 endpoint |
COPILOTX_API_KEY |
copilotx |
API key |
XHSX_VISION_MODEL |
gpt-4o |
图片 OCR 模型 |
也可在项目根目录放 .env 文件,首次导入时会自动加载。
🧱 架构
URL ─► fetcher.py ─► Playwright headless chromium
└─ window.__INITIAL_STATE__ (JSON)
└─ Note(title, desc, images, author, tags)
images ─► llm.ocr_images() ─► 下载(带 Referer 头)
└─ base64 内联为 data URL
└─ 并发调用 vision LLM
└─ List[text] (每页一段)
Note + OCR ─► pipeline._assemble_markdown()
├─ 启发式跨页拼接(末字未收尾 + 下一页非新段 → 粘起来)
└─ H1 / 作者 / 摘要 / 正文 / 标签 / 原文链接
🧪 作为库调用
import asyncio
from xhsx.pipeline import extract
async def main():
result = await extract("https://xhslink.com/xxxxx")
print(result.merged_article)
asyncio.run(main())
返回值是 ExtractResult:
note: 结构化笔记(标题、作者、图片 URL 列表、标签……)ocr_texts: 每页的 OCR 原文merged_article: 拼好的 Markdownelapsed_sec: 耗时
🗺 Roadmap
- Phase 1: CLI / Python 库(当前)
- Phase 2: HTTP API(FastAPI)
- Phase 3: Web UI(粘贴链接即用 + 历史记录)
⚠️ 注意
- 小红书页面结构会变化,
fetcher.py的__INITIAL_STATE__路径可能需要跟随调整 - 登录墙内容(部分私密笔记)抓不到
- OCR 偶有错字(模型能力所限),不做后处理 —— 避免"改对 A 字,改错 B 字"
- 默认并发 4,过高容易触发风控
- 请遵守小红书服务条款与目标内容版权,仅用于个人内容整理
📝 License
MIT © Polly (Baoli Wang)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xhsx-0.2.0.tar.gz.
File metadata
- Download URL: xhsx-0.2.0.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
343b86e3d66d639919f934bba3147c5835789b2b403b6aeb9fccf8bb292d6684
|
|
| MD5 |
57af8a225dd761890ac72d2454ec3dcb
|
|
| BLAKE2b-256 |
f9875b00078d92209ae49bb6bd425f716099dadd93fbb67f504c3e18167af572
|
File details
Details for the file xhsx-0.2.0-py3-none-any.whl.
File metadata
- Download URL: xhsx-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66de50c6b0c504b0d9d7e9ff2142bc04b580db5406272a7d1b615b3f847a2c7a
|
|
| MD5 |
860768ee02cc76f8f4eb042992ae908f
|
|
| BLAKE2b-256 |
ecfdcc073d9b703db12de911750ff470da70f8a8c079cdb6da5fad334730034c
|