Skip to main content

Search the web, rank results, fetch any page content.

Project description

SkySearch

基于 Bing 搜索 + BM25 排序的轻量级搜索引擎,支持动态页面抓取。

特性

  • Bing 搜索:SessionPage 底层 TLS 指纹伪装,绕过反爬
  • 动态页面:Chromium 多标签页并行抓取
  • BM25 排序:jieba 中文分词 + rank-bm25 相关性排序
  • 多模式输出text / info / raw 三种输出模式
  • CLI 工具:支持命令行参数,也可用作 Python 库

安装

pip install skysearch

命令行使用

搜索模式

# 交互式输入
skysearch

# 指定关键词
skysearch "深度学习框架"

# 指定结果数量
skysearch -n 20 "Python教程"

# 保持浏览器打开
skysearch -n 20 "关键词" --keep

URL 抓取模式

# 默认:输出纯文本
skysearch --url https://example.com

# 指定输出模式
skysearch --url https://example.com --mode text   # 纯文本(默认)
skysearch --url https://example.com --mode info    # 结构化信息
skysearch --url https://example.com --mode raw     # 原始 HTML

# 保持浏览器打开
skysearch --url https://example.com --keep

作为库使用

import skysearch

# 搜索
results = skysearch.search("深度学习", num=10)
# [{'title': ..., 'url': ..., 'score': 12.5, 'snippet': ...}, ...]

# URL 抓取
text = skysearch.fetch("https://example.com")
info = skysearch.fetch("https://example.com", mode='info')
raw = skysearch.fetch("https://example.com", mode='raw')

# 单独函数
links = skysearch.fetch_links("https://example.com")
info_dict = skysearch.fetch_info("https://example.com")
raw_dict = skysearch.fetch_raw("https://example.com")

# 搜索 + 抓取一体化
results = skysearch.search_and_fetch("关键词", mode='info')

API 参数说明

search(query, num=10, verbose=False, keep=False, tuple_format=False)

参数 说明 默认值
query 搜索关键词 -
num 结果数量 10
verbose 打印详细过程 False
keep 保持浏览器打开 False
tuple_format 返回元组格式 False

fetch(url, mode='text', keep=False, timeout=10, retry=2)

参数 说明 默认值
url 页面 URL -
mode 输出模式:text info raw text
keep 保持浏览器打开 False
timeout 请求超时秒数 10
retry 重试次数 2

search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False)

一体化搜索 + 抓取,返回列表包含结果信息和抓取内容。

输出模式说明

模式 说明 适用场景
text 纯文本正文 人类阅读
info 结构化 JSON(url, title, text, links, meta) 数据分析 / agent
raw 原始 HTML 深度解析

技术栈

模块 技术
HTTP 请求 DrissionPage (SessionPage)
动态渲染 DrissionPage (ChromiumPage)
HTML 解析 BeautifulSoup4 + lxml
正文提取 readability-lxml
中文分词 jieba
排序算法 rank-bm25 (BM25Okapi)

项目结构

src/skysearch/
├── __init__.py       # 库入口,导出所有 API
├── cli.py            # 命令行入口
├── search.py         # Bing 搜索
├── ranker.py         # BM25 排序
├── api.py            # 简洁 API 接口
└── fetcher/          # 页面抓取包
    ├── __init__.py
    ├── core.py       # 核心函数
    ├── session.py    # 会话管理
    └── parser.py     # HTML 解析

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skysearch-0.2.0.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skysearch-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file skysearch-0.2.0.tar.gz.

File metadata

  • Download URL: skysearch-0.2.0.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for skysearch-0.2.0.tar.gz
Algorithm Hash digest
SHA256 738766d518128dcb5c363ba7283618da690a53dbdea24ec5a233d2c66b2ccf99
MD5 a1e01831d58f32cf0a2710891f556fc8
BLAKE2b-256 74885028331af51f5fd23f374d50729a876a3296cacfc1589466b20b9b7fca51

See more details on using hashes here.

File details

Details for the file skysearch-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: skysearch-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for skysearch-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94ea384844b1c51b6e836bdf3b07d13eb5f5281ba44a10fa6a41c169163902c7
MD5 6075f15cfc9a84ab6ca7063ae32cbafc
BLAKE2b-256 671d0a9e8278de95eaaba471db4fa935df83cdde756ceb27e74d97f504263c82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page