Skip to main content

Extract Xiaohongshu (RED) notes as faithful Markdown — fetch + vision OCR + heuristic page stitching, no LLM rewriting.

Project description

XHSExtractor

小红书(RED)链接 → 逐字还原的 Markdown 文章。
抓取 + 图片 OCR + 启发式跨页拼接,一条命令搞定。

PyPI License: MIT

✨ 特性

  • 🔗 支持 xhslink.com 短链和 xiaohongshu.com/.../<note_id> 长链
  • 🖼 自动下载所有图片,并发 OCR(通过 OpenAI 兼容端点调 gpt-4o
  • 🧷 逐字还原:不总结、不改写、不润色 —— 图里是啥,输出就是啥
  • 🪄 启发式跨页拼接:识别句子在页底被截断的情况自动粘接,列表/段落边界保留换行
  • 📝 输出结构化 Markdown(H1 标题 / 作者 / 摘要 / 正文 / 标签 / 原图画廊)
  • 📦 也可输出 JSON,方便下游程序处理

🚀 快速开始

pip install xhsx
playwright install chromium          # 首次需要装 headless 浏览器

# 设置 OpenAI 兼容端点(任选一个)
export COPILOTX_BASE_URL="https://api.openai.com/v1"
export COPILOTX_API_KEY="sk-..."

xhsx "https://xhslink.com/xxxxx"                  # 终端预览
xhsx "https://xhslink.com/xxxxx" -o note.md       # 写入 Markdown
xhsx "https://xhslink.com/xxxxx" --json > out.json
xhsx "https://xhslink.com/xxxxx" --headed         # 调试时显示浏览器

⚙️ 环境变量

变量 默认 说明
COPILOTX_BASE_URL https://api.polly.wang/v1 OpenAI 兼容 endpoint
COPILOTX_API_KEY copilotx API key
XHSX_VISION_MODEL gpt-4o 图片 OCR 模型
XHSX_INVITE_CODES (空) xhsx serve:逗号分隔的邀请码白名单
XHSX_INVITE_CODES_FILE (空) xhsx serve:邀请码文件路径,一行一码,改文件即刻热加载
XHSX_DAILY_QUOTA 50 xhsx serve:每个邀请码每日额度
XHSX_CONCURRENCY 4 OCR 并发数

也可在项目根目录放 .env 文件,首次导入时会自动加载。

🧱 架构

URL ─► fetcher.py ─► Playwright headless chromium
                       └─ window.__INITIAL_STATE__ (JSON)
                             └─ Note(title, desc, images, author, tags)

images ─► llm.ocr_images()  ─► 下载(带 Referer 头)
                               └─ base64 内联为 data URL
                                 └─ 并发调用 vision LLM
                                    └─ List[text] (每页一段)

Note + OCR ─► pipeline._assemble_markdown()
                ├─ 启发式跨页拼接(末字未收尾 + 下一页非新段 → 粘起来)
                └─ H1 / 作者 / 摘要 / 正文 / 标签 / 原文链接

🧪 作为库调用

import asyncio
from xhsx.pipeline import extract

async def main():
    result = await extract("https://xhslink.com/xxxxx")
    print(result.merged_article)

asyncio.run(main())

返回值是 ExtractResult

  • note: 结构化笔记(标题、作者、图片 URL 列表、标签……)
  • ocr_texts: 每页的 OCR 原文
  • merged_article: 拼好的 Markdown
  • elapsed_sec: 耗时

🗺 Roadmap

  • Phase 1: CLI / Python 库(当前)
  • Phase 2: HTTP API(FastAPI)
  • Phase 3: Web UI(粘贴链接即用 + 历史记录)

⚠️ 注意

  • 小红书页面结构会变化,fetcher.py__INITIAL_STATE__ 路径可能需要跟随调整
  • 登录墙内容(部分私密笔记)抓不到
  • OCR 偶有错字(模型能力所限),不做后处理 —— 避免"改对 A 字,改错 B 字"
  • 默认并发 4,过高容易触发风控
  • 请遵守小红书服务条款与目标内容版权,仅用于个人内容整理

📝 License

MIT © Polly (Baoli Wang)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xhsx-0.4.24.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xhsx-0.4.24-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file xhsx-0.4.24.tar.gz.

File metadata

  • Download URL: xhsx-0.4.24.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.4.24.tar.gz
Algorithm Hash digest
SHA256 7c5b74767957393e087edd2b240899df7199d154a73e7049f71a3098ee37ee4b
MD5 de15f96529b5bc84ba2b2046e629df81
BLAKE2b-256 c67bbde37a8bebed3e33588165d7c84b7ffd3a5df71005048d17e654f4e510e1

See more details on using hashes here.

File details

Details for the file xhsx-0.4.24-py3-none-any.whl.

File metadata

  • Download URL: xhsx-0.4.24-py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.4.24-py3-none-any.whl
Algorithm Hash digest
SHA256 28d4f95dcc6b3392dcbc83f394db7de4ad01854506c922af8f0ee62b20456289
MD5 4eb4e070761ea8876334745892805042
BLAKE2b-256 884312c9ecdd5d5e08dadde36656a9ae350a229eee27a115a79d80d46dec6643

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page