Extract Xiaohongshu (RED) notes as faithful Markdown — fetch + vision OCR + heuristic page stitching, no LLM rewriting.
Project description
XHSExtractor
小红书(RED)链接 → 逐字还原的 Markdown 文章。
抓取 + 图片 OCR + 启发式跨页拼接,一条命令搞定。
✨ 特性
- 🔗 支持
xhslink.com短链和xiaohongshu.com/.../<note_id>长链 - 🖼 自动下载所有图片,并发 OCR(通过 OpenAI 兼容端点调
gpt-4o) - 🧷 逐字还原:不总结、不改写、不润色 —— 图里是啥,输出就是啥
- 🪄 启发式跨页拼接:识别句子在页底被截断的情况自动粘接,列表/段落边界保留换行
- 📝 输出结构化 Markdown(H1 标题 / 作者 / 摘要 / 正文 / 标签 / 原图画廊)
- 📦 也可输出 JSON,方便下游程序处理
🚀 快速开始
pip install xhsx
playwright install chromium # 首次需要装 headless 浏览器
# 设置 OpenAI 兼容端点(任选一个)
export COPILOTX_BASE_URL="https://api.openai.com/v1"
export COPILOTX_API_KEY="sk-..."
xhsx "https://xhslink.com/xxxxx" # 终端预览
xhsx "https://xhslink.com/xxxxx" -o note.md # 写入 Markdown
xhsx "https://xhslink.com/xxxxx" --json > out.json
xhsx "https://xhslink.com/xxxxx" --headed # 调试时显示浏览器
⚙️ 环境变量
| 变量 | 默认 | 说明 |
|---|---|---|
COPILOTX_BASE_URL |
https://api.polly.wang/v1 |
OpenAI 兼容 endpoint |
COPILOTX_API_KEY |
copilotx |
API key |
XHSX_VISION_MODEL |
gpt-4o |
图片 OCR 模型 |
XHSX_INVITE_CODES |
(空) | 仅 xhsx serve:逗号分隔的邀请码白名单 |
XHSX_INVITE_CODES_FILE |
(空) | 仅 xhsx serve:邀请码文件路径,一行一码,改文件即刻热加载 |
XHSX_DAILY_QUOTA |
50 |
仅 xhsx serve:每个邀请码每日额度 |
XHSX_CONCURRENCY |
4 |
OCR 并发数 |
也可在项目根目录放 .env 文件,首次导入时会自动加载。
🧱 架构
URL ─► fetcher.py ─► Playwright headless chromium
└─ window.__INITIAL_STATE__ (JSON)
└─ Note(title, desc, images, author, tags)
images ─► llm.ocr_images() ─► 下载(带 Referer 头)
└─ base64 内联为 data URL
└─ 并发调用 vision LLM
└─ List[text] (每页一段)
Note + OCR ─► pipeline._assemble_markdown()
├─ 启发式跨页拼接(末字未收尾 + 下一页非新段 → 粘起来)
└─ H1 / 作者 / 摘要 / 正文 / 标签 / 原文链接
🧪 作为库调用
import asyncio
from xhsx.pipeline import extract
async def main():
result = await extract("https://xhslink.com/xxxxx")
print(result.merged_article)
asyncio.run(main())
返回值是 ExtractResult:
note: 结构化笔记(标题、作者、图片 URL 列表、标签……)ocr_texts: 每页的 OCR 原文merged_article: 拼好的 Markdownelapsed_sec: 耗时
🗺 Roadmap
- Phase 1: CLI / Python 库(当前)
- Phase 2: HTTP API(FastAPI)
- Phase 3: Web UI(粘贴链接即用 + 历史记录)
⚠️ 注意
- 小红书页面结构会变化,
fetcher.py的__INITIAL_STATE__路径可能需要跟随调整 - 登录墙内容(部分私密笔记)抓不到
- OCR 偶有错字(模型能力所限),不做后处理 —— 避免"改对 A 字,改错 B 字"
- 默认并发 4,过高容易触发风控
- 请遵守小红书服务条款与目标内容版权,仅用于个人内容整理
📝 License
MIT © Polly (Baoli Wang)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
xhsx-0.4.24.tar.gz
(27.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
xhsx-0.4.24-py3-none-any.whl
(31.1 kB
view details)
File details
Details for the file xhsx-0.4.24.tar.gz.
File metadata
- Download URL: xhsx-0.4.24.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c5b74767957393e087edd2b240899df7199d154a73e7049f71a3098ee37ee4b
|
|
| MD5 |
de15f96529b5bc84ba2b2046e629df81
|
|
| BLAKE2b-256 |
c67bbde37a8bebed3e33588165d7c84b7ffd3a5df71005048d17e654f4e510e1
|
File details
Details for the file xhsx-0.4.24-py3-none-any.whl.
File metadata
- Download URL: xhsx-0.4.24-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28d4f95dcc6b3392dcbc83f394db7de4ad01854506c922af8f0ee62b20456289
|
|
| MD5 |
4eb4e070761ea8876334745892805042
|
|
| BLAKE2b-256 |
884312c9ecdd5d5e08dadde36656a9ae350a229eee27a115a79d80d46dec6643
|