Skip to main content

Extract Xiaohongshu (RED) notes as faithful Markdown — fetch + vision OCR + heuristic page stitching, no LLM rewriting.

Reason this release was yanked:

Contained hardcoded invite codes; superseded by 0.4.24

Project description

XHSExtractor

小红书(RED)链接 → 逐字还原的 Markdown 文章。
抓取 + 图片 OCR + 启发式跨页拼接,一条命令搞定。

PyPI License: MIT

✨ 特性

  • 🔗 支持 xhslink.com 短链和 xiaohongshu.com/.../<note_id> 长链
  • 🖼 自动下载所有图片,并发 OCR(通过 OpenAI 兼容端点调 gpt-4o
  • 🧷 逐字还原:不总结、不改写、不润色 —— 图里是啥,输出就是啥
  • 🪄 启发式跨页拼接:识别句子在页底被截断的情况自动粘接,列表/段落边界保留换行
  • 📝 输出结构化 Markdown(H1 标题 / 作者 / 摘要 / 正文 / 标签 / 原图画廊)
  • 📦 也可输出 JSON,方便下游程序处理

🚀 快速开始

pip install xhsx
playwright install chromium          # 首次需要装 headless 浏览器

# 设置 OpenAI 兼容端点(任选一个)
export COPILOTX_BASE_URL="https://api.openai.com/v1"
export COPILOTX_API_KEY="sk-..."

xhsx "https://xhslink.com/xxxxx"                  # 终端预览
xhsx "https://xhslink.com/xxxxx" -o note.md       # 写入 Markdown
xhsx "https://xhslink.com/xxxxx" --json > out.json
xhsx "https://xhslink.com/xxxxx" --headed         # 调试时显示浏览器

⚙️ 环境变量

变量 默认 说明
COPILOTX_BASE_URL https://api.polly.wang/v1 OpenAI 兼容 endpoint
COPILOTX_API_KEY copilotx API key
XHSX_VISION_MODEL gpt-4o 图片 OCR 模型
XHSX_INVITE_CODES (空) xhsx serve:逗号分隔的邀请码白名单
XHSX_INVITE_CODES_FILE (空) xhsx serve:邀请码文件路径,一行一码,改文件即刻热加载
XHSX_DAILY_QUOTA 50 xhsx serve:每个邀请码每日额度
XHSX_CONCURRENCY 4 OCR 并发数

也可在项目根目录放 .env 文件,首次导入时会自动加载。

🧱 架构

URL ─► fetcher.py ─► Playwright headless chromium
                       └─ window.__INITIAL_STATE__ (JSON)
                             └─ Note(title, desc, images, author, tags)

images ─► llm.ocr_images()  ─► 下载(带 Referer 头)
                               └─ base64 内联为 data URL
                                 └─ 并发调用 vision LLM
                                    └─ List[text] (每页一段)

Note + OCR ─► pipeline._assemble_markdown()
                ├─ 启发式跨页拼接(末字未收尾 + 下一页非新段 → 粘起来)
                └─ H1 / 作者 / 摘要 / 正文 / 标签 / 原文链接

🧪 作为库调用

import asyncio
from xhsx.pipeline import extract

async def main():
    result = await extract("https://xhslink.com/xxxxx")
    print(result.merged_article)

asyncio.run(main())

返回值是 ExtractResult

  • note: 结构化笔记(标题、作者、图片 URL 列表、标签……)
  • ocr_texts: 每页的 OCR 原文
  • merged_article: 拼好的 Markdown
  • elapsed_sec: 耗时

🗺 Roadmap

  • Phase 1: CLI / Python 库(当前)
  • Phase 2: HTTP API(FastAPI)
  • Phase 3: Web UI(粘贴链接即用 + 历史记录)

⚠️ 注意

  • 小红书页面结构会变化,fetcher.py__INITIAL_STATE__ 路径可能需要跟随调整
  • 登录墙内容(部分私密笔记)抓不到
  • OCR 偶有错字(模型能力所限),不做后处理 —— 避免"改对 A 字,改错 B 字"
  • 默认并发 4,过高容易触发风控
  • 请遵守小红书服务条款与目标内容版权,仅用于个人内容整理

📝 License

MIT © Polly (Baoli Wang)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xhsx-0.4.14.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xhsx-0.4.14-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file xhsx-0.4.14.tar.gz.

File metadata

  • Download URL: xhsx-0.4.14.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.4.14.tar.gz
Algorithm Hash digest
SHA256 b818de26892f4c037f93d47b1e188064ab9d797d7b09d5d9a605f645a408a21e
MD5 eaf1615ec4298f7b50065e6a6c054901
BLAKE2b-256 70883e9de9a29058b2e7a3122f9b0db7941dd3dc8cb475330591547a2167b7fb

See more details on using hashes here.

File details

Details for the file xhsx-0.4.14-py3-none-any.whl.

File metadata

  • Download URL: xhsx-0.4.14-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.4.14-py3-none-any.whl
Algorithm Hash digest
SHA256 08a79ce611fa81456af7172fbd607f466be34f337796896a9aaeb85c1c2fc301
MD5 493ad51d40d046839d13bd14b13d3f5b
BLAKE2b-256 05990b14049a779eb7f605cc99ef91be267574db56b8f1815617845ab06728e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page