Skip to main content

Extract Xiaohongshu (RED) notes as faithful Markdown — fetch + vision OCR + heuristic page stitching, no LLM rewriting.

Reason this release was yanked:

Contained hardcoded invite codes; superseded by 0.4.24

Project description

XHSExtractor

小红书(RED)链接 → 逐字还原的 Markdown 文章。
抓取 + 图片 OCR + 启发式跨页拼接,一条命令搞定。

PyPI License: MIT

✨ 特性

  • 🔗 支持 xhslink.com 短链和 xiaohongshu.com/.../<note_id> 长链
  • 🖼 自动下载所有图片,并发 OCR(通过 OpenAI 兼容端点调 gpt-4o
  • 🧷 逐字还原:不总结、不改写、不润色 —— 图里是啥,输出就是啥
  • 🪄 启发式跨页拼接:识别句子在页底被截断的情况自动粘接,列表/段落边界保留换行
  • 📝 输出结构化 Markdown(H1 标题 / 作者 / 摘要 / 正文 / 标签 / 原图画廊)
  • 📦 也可输出 JSON,方便下游程序处理

🚀 快速开始

pip install xhsx
playwright install chromium          # 首次需要装 headless 浏览器

# 设置 OpenAI 兼容端点(任选一个)
export COPILOTX_BASE_URL="https://api.openai.com/v1"
export COPILOTX_API_KEY="sk-..."

xhsx "https://xhslink.com/xxxxx"                  # 终端预览
xhsx "https://xhslink.com/xxxxx" -o note.md       # 写入 Markdown
xhsx "https://xhslink.com/xxxxx" --json > out.json
xhsx "https://xhslink.com/xxxxx" --headed         # 调试时显示浏览器

⚙️ 环境变量

变量 默认 说明
COPILOTX_BASE_URL https://api.polly.wang/v1 OpenAI 兼容 endpoint
COPILOTX_API_KEY copilotx API key
XHSX_VISION_MODEL gpt-4o 图片 OCR 模型

也可在项目根目录放 .env 文件,首次导入时会自动加载。

🧱 架构

URL ─► fetcher.py ─► Playwright headless chromium
                       └─ window.__INITIAL_STATE__ (JSON)
                             └─ Note(title, desc, images, author, tags)

images ─► llm.ocr_images()  ─► 下载(带 Referer 头)
                               └─ base64 内联为 data URL
                                 └─ 并发调用 vision LLM
                                    └─ List[text] (每页一段)

Note + OCR ─► pipeline._assemble_markdown()
                ├─ 启发式跨页拼接(末字未收尾 + 下一页非新段 → 粘起来)
                └─ H1 / 作者 / 摘要 / 正文 / 标签 / 原文链接

🧪 作为库调用

import asyncio
from xhsx.pipeline import extract

async def main():
    result = await extract("https://xhslink.com/xxxxx")
    print(result.merged_article)

asyncio.run(main())

返回值是 ExtractResult

  • note: 结构化笔记(标题、作者、图片 URL 列表、标签……)
  • ocr_texts: 每页的 OCR 原文
  • merged_article: 拼好的 Markdown
  • elapsed_sec: 耗时

🗺 Roadmap

  • Phase 1: CLI / Python 库(当前)
  • Phase 2: HTTP API(FastAPI)
  • Phase 3: Web UI(粘贴链接即用 + 历史记录)

⚠️ 注意

  • 小红书页面结构会变化,fetcher.py__INITIAL_STATE__ 路径可能需要跟随调整
  • 登录墙内容(部分私密笔记)抓不到
  • OCR 偶有错字(模型能力所限),不做后处理 —— 避免"改对 A 字,改错 B 字"
  • 默认并发 4,过高容易触发风控
  • 请遵守小红书服务条款与目标内容版权,仅用于个人内容整理

📝 License

MIT © Polly (Baoli Wang)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xhsx-0.3.0.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xhsx-0.3.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file xhsx-0.3.0.tar.gz.

File metadata

  • Download URL: xhsx-0.3.0.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.3.0.tar.gz
Algorithm Hash digest
SHA256 5151d8047e39bf742518b7b137ade9675a130176d5825e94bd7e1f52fb9d9abd
MD5 1336bd887859ac12ebbde3ccb98615b4
BLAKE2b-256 11d62cf4c709d8f53e560e42f97de30fa2d3a7028cf2303dd4a3e3568695e5ba

See more details on using hashes here.

File details

Details for the file xhsx-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: xhsx-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b0730ab31e3eedb0ca6b1eaba470698764748fc825421d213dbd514b6fb9016
MD5 4a3063dd03673aaab57567e190877a50
BLAKE2b-256 8c2143ef0efd9d30449ee707b46ae3872a1044d98aeadd5d1e5beed1abb1b23b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page