Skip to main content

Extract Xiaohongshu (RED) notes as faithful Markdown — fetch + vision OCR + heuristic page stitching, no LLM rewriting.

Reason this release was yanked:

Contained hardcoded invite codes; superseded by 0.4.24

Project description

XHSExtractor

小红书(RED)链接 → 逐字还原的 Markdown 文章。
抓取 + 图片 OCR + 启发式跨页拼接,一条命令搞定。

PyPI License: MIT

✨ 特性

  • 🔗 支持 xhslink.com 短链和 xiaohongshu.com/.../<note_id> 长链
  • 🖼 自动下载所有图片,并发 OCR(通过 OpenAI 兼容端点调 gpt-4o
  • 🧷 逐字还原:不总结、不改写、不润色 —— 图里是啥,输出就是啥
  • 🪄 启发式跨页拼接:识别句子在页底被截断的情况自动粘接,列表/段落边界保留换行
  • 📝 输出结构化 Markdown(H1 标题 / 作者 / 摘要 / 正文 / 标签 / 原图画廊)
  • 📦 也可输出 JSON,方便下游程序处理

🚀 快速开始

pip install xhsx
playwright install chromium          # 首次需要装 headless 浏览器

# 设置 OpenAI 兼容端点(任选一个)
export COPILOTX_BASE_URL="https://api.openai.com/v1"
export COPILOTX_API_KEY="sk-..."

xhsx "https://xhslink.com/xxxxx"                  # 终端预览
xhsx "https://xhslink.com/xxxxx" -o note.md       # 写入 Markdown
xhsx "https://xhslink.com/xxxxx" --json > out.json
xhsx "https://xhslink.com/xxxxx" --headed         # 调试时显示浏览器

⚙️ 环境变量

变量 默认 说明
COPILOTX_BASE_URL https://api.polly.wang/v1 OpenAI 兼容 endpoint
COPILOTX_API_KEY copilotx API key
XHSX_VISION_MODEL gpt-4o 图片 OCR 模型

也可在项目根目录放 .env 文件,首次导入时会自动加载。

🧱 架构

URL ─► fetcher.py ─► Playwright headless chromium
                       └─ window.__INITIAL_STATE__ (JSON)
                             └─ Note(title, desc, images, author, tags)

images ─► llm.ocr_images()  ─► 下载(带 Referer 头)
                               └─ base64 内联为 data URL
                                 └─ 并发调用 vision LLM
                                    └─ List[text] (每页一段)

Note + OCR ─► pipeline._assemble_markdown()
                ├─ 启发式跨页拼接(末字未收尾 + 下一页非新段 → 粘起来)
                └─ H1 / 作者 / 摘要 / 正文 / 标签 / 原文链接

🧪 作为库调用

import asyncio
from xhsx.pipeline import extract

async def main():
    result = await extract("https://xhslink.com/xxxxx")
    print(result.merged_article)

asyncio.run(main())

返回值是 ExtractResult

  • note: 结构化笔记(标题、作者、图片 URL 列表、标签……)
  • ocr_texts: 每页的 OCR 原文
  • merged_article: 拼好的 Markdown
  • elapsed_sec: 耗时

🗺 Roadmap

  • Phase 1: CLI / Python 库(当前)
  • Phase 2: HTTP API(FastAPI)
  • Phase 3: Web UI(粘贴链接即用 + 历史记录)

⚠️ 注意

  • 小红书页面结构会变化,fetcher.py__INITIAL_STATE__ 路径可能需要跟随调整
  • 登录墙内容(部分私密笔记)抓不到
  • OCR 偶有错字(模型能力所限),不做后处理 —— 避免"改对 A 字,改错 B 字"
  • 默认并发 4,过高容易触发风控
  • 请遵守小红书服务条款与目标内容版权,仅用于个人内容整理

📝 License

MIT © Polly (Baoli Wang)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xhsx-0.2.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xhsx-0.2.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file xhsx-0.2.0.tar.gz.

File metadata

  • Download URL: xhsx-0.2.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 343b86e3d66d639919f934bba3147c5835789b2b403b6aeb9fccf8bb292d6684
MD5 57af8a225dd761890ac72d2454ec3dcb
BLAKE2b-256 f9875b00078d92209ae49bb6bd425f716099dadd93fbb67f504c3e18167af572

See more details on using hashes here.

File details

Details for the file xhsx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: xhsx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for xhsx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66de50c6b0c504b0d9d7e9ff2142bc04b580db5406272a7d1b615b3f847a2c7a
MD5 860768ee02cc76f8f4eb042992ae908f
BLAKE2b-256 ecfdcc073d9b703db12de911750ff470da70f8a8c079cdb6da5fad334730034c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page