A commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ThreeFish-AI

These details have not been verified by PyPI

Project description

Negentropy Perceives

商业级 MCP Server — 给 AI Agent 装上一双能看懂网页和 PDF 的眼睛，而且这双眼睛会隐身。

14 个专业工具 · 5 引擎 PDF 处理 · 反检测抓取 · LLM 智能编排

✨ 为什么选择 Negentropy Perceives？

🧠 Smart 模式	🥷 反检测抓取	⚡ 五引擎降级
LLM 编排多引擎并行处理自动分析文档特征 → 并行调度 Docling / PyMuPDF → 择优融合最佳输出。学术论文、财报、技术手册，一个 `method="smart"` 搞定。	Selenium + Playwright 双引擎隐身随机 UA 轮换、浏览器指纹隐藏、人类行为模拟（鼠标轨迹、滚动延迟）。绕过 Cloudflare、reCAPTCHA 等主流反爬系统。	Docling → MinerU → Marker → PyMuPDF → PyPDF 自动降级链确保零宕机。未安装的引擎自动跳过，最小依赖集即可运行。GPU 加速（CUDA / MPS / XPU）可选开启。

📖 更多企业级特性

🔒 合规优先: 内置 check_robots_txt 工具，抓取前自动检查爬虫规则
🚀 并发批处理: scrape_multiple_webpages / batch_convert_pdfs_to_markdown 支持 asyncio 并发
📊 可观测性: 内置请求计量、执行计时、错误分类 (get_server_metrics)
🔄 弹性保障: 指数退避重试、频率限速、内存缓存三层防护
🎯 结构化提取: CSS 选择器映射 + 6 种数据类型模板（contact / social / content / products / addresses）
🖼️ 深度内容提取: 表格识别、LaTeX 公式保持、图像 base64 嵌入
⚙️ YAML 四层配置: 内置默认 → 用户 YAML → 环境变量 → -c 显式(最高)，优先级清晰

🚀 快速开始

安装

uv add negentropy-perceives

需要 uv 包管理器和 Python >= 3.13。

Hello World

from negentropy.perceives.sdk import NegentropyPerceivesClient

async with NegentropyPerceivesClient() as client:
    markdown = await client.convert_webpage_to_markdown("https://example.com")

启动 MCP Server

negentropy-perceives   # 默认 STDIO 模式，通过环境变量切换 HTTP / SSE

⌨️ 更多示例：PDF 转换 · CSS 选择器提取 · 反检测抓取

PDF 转 Markdown

async with NegentropyPerceivesClient() as client:
    result = await client.call_tool("convert_pdf_to_markdown", {
        "pdf_source": "report.pdf",
        "method": "smart",           # auto / pymupdf / pypdf / docling / smart
        "page_range": "1-10",
    })

CSS 选择器精准提取

async with NegentropyPerceivesClient() as client:
    result = await client.scrape_webpage(
        url="https://shop.example.com/product/123",
        extract_config={
            "title":  {"selector": "h1",              "attr": "text"},
            "price":  {"selector": ".price",          "attr": "text"},
            "images": {"selector": ".gallery img",    "attr": "src", "multiple": True},
        },
    )

反检测抓取

async with NegentropyPerceivesClient() as client:
    result = await client.call_tool("scrape_with_stealth", {
        "url": "https://protected-site.com",
        "method": "selenium",         # selenium / playwright
        "scroll_page": True,
    })

完整 API 参考与高级用法详见用户指南。

🛠️ 工具全景 (14 个专业 MCP 工具)

🕷️ 网页抓取 (10 工具)

工具	一句话	核心能力
`scrape_webpage`	单页抓取	auto / simple / selenium 方法自动选择
`scrape_multiple_webpages`	批量并发	asyncio.gather 并发处理 URL 列表
`scrape_with_stealth`	反检测隐身	Selenium / Playwright + UA 轮换 + 行为模拟
`fill_and_submit_form`	表单自动化	自动填写 + 提交，支持所有表单元素
`extract_links`	链接提取	域名过滤、内外链分类
`extract_structured_data`	结构化数据	contact / social / content / products / addresses
`get_page_info`	页面侦察	标题、状态码、元数据一键获取
`check_robots_txt`	合规检查	robots.txt 解析 + 爬取权限判断
`convert_webpage_to_markdown`	页面 → MD	主内容提取 + 格式化选项 + 图片嵌入
`batch_convert_webpages_to_markdown`	批量转 MD	多 URL 并发转换

📄 PDF 处理 (2 工具)

工具	一句话	核心能力
`convert_pdf_to_markdown`	PDF → MD	5 引擎降级链 + 图像 / 表格 / 公式提取 + Smart 模式
`batch_convert_pdfs_to_markdown`	批量 PDF	多文档并发 + 统计摘要

🔧 PDF 引擎降级链详情

Docling (MIT, 最佳整体质量)
  └─→ MinerU (Apache 2.0, 最佳 LaTeX 公式)
       └─→ Marker (GPL-3.0, 最高准确率 95.67%)
            └─→ PyMuPDF (快速纯文本)
                 └─→ PyPDF (基础兜底)

各引擎均为可选依赖 — 未安装时自动跳过，确保最小依赖集下仍可运行。

Smart 模式 (method="smart"): LLM 三阶段编排 — 分析文档特征 → 并行调度多引擎 → 择优融合输出。需安装 litellm 并配置 API Key。

📡 服务管理 (2 工具)

工具	功能
`get_server_metrics`	请求统计、性能指标、缓存命中率
`clear_cache`	一键清空内存缓存

🔄 传输模式

模式	适用场景	推荐度
STDIO (默认)	本地开发、Claude Desktop	⭐⭐⭐
HTTP	生产环境、远程访问、多客户端	⭐⭐⭐⭐⭐
SSE	遗留系统兼容	⭐⭐

详细配置（host / port / CORS / 认证）参见用户指南 > MCP Server 配置。

🏗️ 架构一览

工具协同流水线

graph TD
    %% ─── 数据提取层 ───
    subgraph EXTRACT["数据提取 · Extraction"]
        T7["check_robots_txt<br/>爬虫合规检查"]
        T5["get_page_info<br/>页面元数据"]
        T4["extract_links<br/>链接提取"]
        T6["extract_structured_data<br/>结构化数据"]
    end

    subgraph SCRAPE["网页抓取 · Web Scraping"]
        T1["scrape_webpage<br/>单页抓取"]
        T2["scrape_multiple_webpages<br/>批量并发"]
        T3["scrape_with_stealth<br/>反检测抓取"]
    end

    subgraph FORM["表单交互 · Form"]
        T8["fill_and_submit_form<br/>表单自动化"]
    end

    %% ─── 内容转换层 ───
    subgraph MD["Markdown 转换"]
        T9["convert_webpage_to_markdown<br/>页面 → MD"]
        T10["batch_convert_webpages_to_markdown<br/>批量转换"]
    end

    subgraph PDF["PDF 转换"]
        T11["convert_pdf_to_markdown<br/>PDF → MD"]
        T12["batch_convert_pdfs_to_markdown<br/>批量转换"]
    end

    %% ─── 生产协同链路 ───
    T7 -->|"合规准入"| T1
    T7 -->|"合规准入"| T3
    T5 -->|"站点侦察"| T4
    T4 -->|"URL 清单"| T2
    T1 -->|"原始内容"| T6
    T3 -->|"隐身内容"| T9
    T2 -->|"批量内容"| T10
    T8 -->|"表单响应"| T6

    %% ─── 知识汇聚 ───
    T10 --> KB(["Knowledge · Fact"])
    T11 --> KB
    T12 --> KB

典型协同场景：

场景	工具链路
合规优先抓取	`check_robots_txt` → `scrape_webpage` → `extract_structured_data`
隐身采集	`check_robots_txt` → `scrape_with_stealth` → `convert_webpage_to_markdown`
深度站点探索	`get_page_info` → `extract_links` → `scrape_multiple_webpages` → `batch_convert_webpages_to_markdown`
表单数据采集	`fill_and_submit_form` → `extract_structured_data`

完整架构设计（5 层分解、模块依赖、数据流）详见架构设计。

🎯 典型场景

📰 新闻监控 & 知识归档

批量抓取多个新闻源 → 提取标题 / 正文 / 时间戳 → 转为 Markdown 归档：

# 批量抓取结构化内容
result = await client.call_tool("scrape_multiple_webpages", {
    "urls": ["https://news.ycombinator.com", "https://techcrunch.com"],
    "extract_config": {"headlines": {"selector": "h1, h2", "multiple": True}},
})

# 批量转为 Markdown 归档
await client.call_tool("batch_convert_webpages_to_markdown", {
    "urls": ["https://news.ycombinator.com", "https://techcrunch.com"],
    "extract_main_content": True,
})

🎓 学术论文智能处理

利用 Smart 模式自动处理含公式、表格、代码、图像的复杂学术 PDF：

result = await client.call_tool("convert_pdf_to_markdown", {
    "pdf_source": "arxiv_paper.pdf",
    "method": "smart",              # LLM 编排多引擎
})
# 返回包含 LaTeX 公式、Markdown 表格、代码块的高质量输出

🛒 电商数据结构化采集

CSS 选择器映射 → 产品列表抓取 → 详情页批量深入：

products = await client.scrape_webpage(
    url="https://shop.example.com/products",
    extract_config={
        "names":   {"selector": ".product-name", "multiple": True},
        "prices":  {"selector": ".price",       "multiple": True},
        "links":   {"selector": ".product-card a[href]", "attr": "href", "multiple": True},
    },
)

📚 文档导航

文档	目标读者	内容概要
用户指南	所有用户	MCP 配置、14 工具详解、API 参考、FAQ
架构设计	架构师 / 贡献者	5 层架构、引擎设计、模块依赖
开发指南	开发者 / QA	环境搭建、测试、编码规范、发布流程
用户指南 > MCP Server 配置	运维 / 开发者	YAML 三层配置、环境变量速查
版本里程	所有用户	版本历史与变更记录

🤝 参与贡献

欢迎通过 Issue 反馈问题，或提交 Pull Request 改进项目。

贡献前请阅读开发指南了解代码规范与提交流程。

📄 许可证

⚠️ 伦理提醒: 技术本身是中立的，但使用者的选择定义了它的价值。请负责任地使用本工具——遵守网站 robots.txt 规则、尊重知识产权、合理控制请求频率。

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ThreeFish-AI

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0a3 pre-release

Apr 20, 2026

This version

0.2.0a1 pre-release

Apr 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

negentropy_perceives-0.2.0a1.tar.gz (189.7 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

negentropy_perceives-0.2.0a1-py3-none-any.whl (233.4 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file negentropy_perceives-0.2.0a1.tar.gz.

File metadata

Download URL: negentropy_perceives-0.2.0a1.tar.gz
Upload date: Apr 11, 2026
Size: 189.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for negentropy_perceives-0.2.0a1.tar.gz
Algorithm	Hash digest
SHA256	`11a77a332a45e546c5494b465d58e59ca29e9bb57c7e992afefc286ed60f9778`
MD5	`055fd4eaf525362959f8b46629fa8bf6`
BLAKE2b-256	`c6880a4c483f3f83239ad04edee29720f64a75b4fca1697f744599a72abbf6dd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for negentropy_perceives-0.2.0a1.tar.gz:

Publisher: release.yml on ThreeFish-AI/negentropy-perceives

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: negentropy_perceives-0.2.0a1.tar.gz
- Subject digest: 11a77a332a45e546c5494b465d58e59ca29e9bb57c7e992afefc286ed60f9778
- Sigstore transparency entry: 1276945814
- Sigstore integration time: Apr 11, 2026
Source repository:
- Permalink: ThreeFish-AI/negentropy-perceives@5509a6b600b21651b69343f560a00cbaa32d93b6
- Branch / Tag: refs/tags/v0.2.0a1
- Owner: https://github.com/ThreeFish-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@5509a6b600b21651b69343f560a00cbaa32d93b6
- Trigger Event: push

File details

Details for the file negentropy_perceives-0.2.0a1-py3-none-any.whl.

File metadata

Download URL: negentropy_perceives-0.2.0a1-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 233.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for negentropy_perceives-0.2.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ecf28a0edcd7002e0583632081cc429c35a145137f6d5f6b114be8f36a477a6a`
MD5	`79855ab7c50d3e56ca3516e97c22b703`
BLAKE2b-256	`f9b1d8611e07f253782c5681adc34d90b3b788c463fac5fc8f399bb4c4633456`

See more details on using hashes here.

Provenance

The following attestation bundles were made for negentropy_perceives-0.2.0a1-py3-none-any.whl:

Publisher: release.yml on ThreeFish-AI/negentropy-perceives

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: negentropy_perceives-0.2.0a1-py3-none-any.whl
- Subject digest: ecf28a0edcd7002e0583632081cc429c35a145137f6d5f6b114be8f36a477a6a
- Sigstore transparency entry: 1276945827
- Sigstore integration time: Apr 11, 2026
Source repository:
- Permalink: ThreeFish-AI/negentropy-perceives@5509a6b600b21651b69343f560a00cbaa32d93b6
- Branch / Tag: refs/tags/v0.2.0a1
- Owner: https://github.com/ThreeFish-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@5509a6b600b21651b69343f560a00cbaa32d93b6
- Trigger Event: push

negentropy-perceives 0.2.0a1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Negentropy Perceives

✨ 为什么选择 Negentropy Perceives？

🚀 快速开始

安装

Hello World

启动 MCP Server

PDF 转 Markdown

CSS 选择器精准提取

反检测抓取

🛠️ 工具全景 (14 个专业 MCP 工具)

🕷️ 网页抓取 (10 工具)

📄 PDF 处理 (2 工具)

📡 服务管理 (2 工具)

🔄 传输模式

🏗️ 架构一览

工具协同流水线

🎯 典型场景

📚 文档导航

🤝 参与贡献

📄 许可证

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance