quickly build your crawler
Project description
简介
Bricks 是一个模块化、事件驱动的 Python 爬虫框架,旨在将爬虫开发变得像搭建积木一样简单而有趣。框架提供了从 纯代码式 到 零代码配置式 的多层次开发体验,让新手快速上手,让专家灵活掌控。
无论是简单的单页抓取、多步骤链式请求,还是分布式大规模爬取,Bricks 都能以一致的编程模型优雅地处理。
✨ 核心特性
| 特性 | 说明 |
|---|---|
| 事件驱动架构 | 在请求前后、存储前后等生命周期节点注册事件钩子,无需修改核心逻辑即可扩展行为 |
| 三种爬虫基类 | air(纯代码)、form(自定义流程配置)、template(固定流程配置),按复杂度选择 |
| 丰富的解析器 | 内置 json / xpath / jsonpath / regex 解析,声明规则即可完成数据提取 |
| 多种下载器 | 默认使用 curl-cffi,可选 requests / httpx / playwright / pycurl 等,支持自定义 |
| 弹性调度器 | 可伸缩线程池,同步/异步任务统一调度,自动根据任务量调节 Worker 数量 |
| 多种任务队列 | 内置 Local(单机)和 Redis(分布式)队列,接口统一,支持自定义扩展 |
| 爬虫 API 化 | 内置 rpc 模式,一键将爬虫转化为可远程调用的 API |
| 代理管理 | 内置 ApiProxy / RedisProxy / ClashProxy 等代理管理器,支持自动轮换、阈值回收 |
🚀 快速开始
安装
# 安装正式版
pip install -U bricks-py
# 安装最新开发版
pip install -U git+https://github.com/KKKKKKKEM/bricks.git
# 安装测试版(beta)
pip install -i https://test.pypi.org/simple/ -U bricks-py
可选下载器依赖:
pip install bricks-py[requests] # requests 下载器
pip install bricks-py[httpx] # httpx 下载器
pip install bricks-py[playwright] # playwright 下载器
最简示例(air 爬虫)
from bricks import Request, const
from bricks.core import events, signals
from bricks.spider import air
from bricks.spider.air import Context
class MySpider(air.Spider):
def make_seeds(self, context: Context, **kwargs):
# 返回要爬取的种子列表
return [{"page": 1}, {"page": 2}, {"page": 3}]
def make_request(self, context: Context) -> Request:
seeds = context.seeds
return Request(
url="https://api.example.com/list",
params={"page": seeds["page"]},
)
def parse(self, context: Context):
return context.response.extract(
engine="json",
rules={"data.list": {"id": "id", "name": "name"}},
)
def item_pipeline(self, context: Context):
print(context.items)
context.success() # 标记种子处理完成
@staticmethod
@events.on(const.AFTER_REQUEST)
def check_response(context: Context):
if context.response.get("code") != 0:
raise signals.Retry # 触发重试
if __name__ == "__main__":
spider = MySpider()
spider.run()
配置式示例(form 爬虫)
from bricks.spider import form
class MySpider(form.Spider):
@property
def config(self) -> form.Config:
return form.Config(
init=[form.Init(func=lambda: {"page": 1})],
spider=[
form.Download(
url="https://api.example.com/list",
params={"page": "{page}"},
),
form.Parse(
func="json",
kwargs={"rules": {"data.list": {"id": "id", "name": "name"}}},
),
form.Pipeline(
func=lambda context: print(context.items),
success=True,
),
],
)
if __name__ == "__main__":
MySpider().run()
📖 文档
| 文档 | 描述 |
|---|---|
| 快速入门 | 5 分钟了解 Bricks 的核心概念和使用方式 |
| 爬虫基类 | air / form / template 三种爬虫的详细说明 |
| 事件系统 | 生命周期钩子、事件注册与触发机制 |
| 解析器 | JSON / XPath / JSONPath / Regex 解析规则详解 |
| 下载器 | 各类下载器的使用与自定义扩展 |
| 任务队列 | Local / Redis 队列与分布式爬虫 |
| 代理管理 | 代理池配置与自动轮换策略 |
| 信号机制 | Retry / Success / Failure 等控制信号 |
| RPC 模式 | 将爬虫暴露为远程 API |
| 存储插件 | 内置 SQLite / MongoDB / Redis / CSV 存储 |
🏗️ 架构概览
bricks/
├── spider/ # 爬虫基类
│ ├── air.py # 纯代码式爬虫
│ ├── form.py # 自定义流程配置式爬虫
│ └── template.py # 固定流程配置式爬虫
├── core/ # 核心机制
│ ├── context.py # 上下文 / 流程控制
│ ├── events.py # 事件管理器
│ ├── genesis.py # 基础类 Chaos / Pangu
│ ├── dispatch.py # 调度器
│ └── signals.py # 控制信号
├── downloader/ # 下载器
├── lib/ # 基础库(Request / Response / Queue / Proxy 等)
├── plugins/ # 内置插件(storage / scripts 等)
└── rpc/ # RPC 模式
爬虫生命周期:
make_seeds → [BEFORE_PUT_SEEDS] → put_seeds → [AFTER_PUT_SEEDS]
↓
[BEFORE_GET_SEEDS] → get_seeds → [AFTER_GET_SEEDS]
↓
[BEFORE_MAKE_REQUEST] → make_request → [AFTER_MAKE_REQUEST]
↓
[BEFORE_REQUEST] → on_request → [AFTER_REQUEST]
↓
on_response (parse)
↓
[BEFORE_PIPELINE] → on_pipeline → [AFTER_PIPELINE]
🤝 贡献
欢迎提交 Issue 和 Pull Request。
📄 License
MIT © Kem
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bricks_py-0.2.0.tar.gz
(170.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
bricks_py-0.2.0-py3-none-any.whl
(229.7 kB
view details)
File details
Details for the file bricks_py-0.2.0.tar.gz.
File metadata
- Download URL: bricks_py-0.2.0.tar.gz
- Upload date:
- Size: 170.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c80a6523488c39fe49224d8f429f689d12f3ade1195b7a238fa5fc08f4751e37
|
|
| MD5 |
1b525b1047109253e763ca01b9688c3d
|
|
| BLAKE2b-256 |
eec98b1e1c23f2fd57a2ca0ba88fdaf6bb41fed8ebca9fbe80b1f832483292cd
|
File details
Details for the file bricks_py-0.2.0-py3-none-any.whl.
File metadata
- Download URL: bricks_py-0.2.0-py3-none-any.whl
- Upload date:
- Size: 229.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3185a847aefc3b6044bb97e434b17492cb175ccd497a950e057c70e8e0076cb
|
|
| MD5 |
ec4e08175e212b2e326a057aee8b998b
|
|
| BLAKE2b-256 |
f8bb5c7ead59ebbc42d7f9955bcacc85fdbeba14fc47a9409c35b41e61912253
|