Best-effort crawler monitor SDK for monitor-system-v2 ingest events

These details have not been verified by PyPI

Project links

Repository

Project description

crawler-monitor-sdk

crawler-monitor-sdk 是给爬虫程序使用的 monitor-system-v2 上报 SDK。

它的定位是：尽量上报，不影响爬虫业务逻辑。SDK 会帮调用方生成 execution_id、构造事件、做本地校验，并把事件发送到 monitor v2 的 ingest API。上报失败、超时、返回非 2xx、payload 校验失败时，SDK 只记录日志并丢弃事件，不会把异常抛回爬虫。

安装

正式发布到 PyPI 或私有 Python 包索引后：

uv add crawler-monitor-sdk

本地试点、尚未发布时：

uv add ../crawler-monitor-sdk

环境变量

名称	必填	默认值	说明
`CRAWLER_MONITOR_PLATFORM`	是	无	monitor v2 的平台标识，例如 `mercari`
`CRAWLER_MONITOR_INGEST_URL`	否	无	ingest 完整地址，例如 `http://crawler-monitor-v2/v2/ingest/events`
`CRAWLER_MONITOR_TIMEOUT_SECONDS`	否	`3`	每次 HTTP 上报的超时时间，单位秒

如果没有设置 CRAWLER_MONITOR_INGEST_URL，SDK 仍然会生成并返回 execution_id，但不会发送 HTTP 请求。

设计原则

Best-effort 上报

SDK 不应该阻断爬虫。

以下情况都不会向调用方抛异常：

payload 本地校验失败
HTTP 请求失败
HTTP 请求超时
monitor 后端返回非 2xx
timeout 环境变量格式错误

如果调用方传入了 logger，SDK 会尽量写 warning 或 exception 日志，方便排查问题。

先生成 execution_id

start_run() 会先生成并返回 execution_id，然后才尝试上报 run_start。

即使 run_start 上报失败，调用方也能继续用同一个 execution_id 上报后续的 error 或 run_complete。这样 monitor 后端在恢复后仍有机会把同一次爬虫运行的事件关联起来。

SDK 不缓存运行状态

SDK 不维护 service_name -> execution_id 这类进程内缓存。

调用方必须保存 start_run() 返回的 execution_id，并在 heartbeat()、report_error()、complete_run() 中显式传入。这样同一次运行的事件归属是代码层面可见的，也避免同一个进程内同名 service 并发运行时互相覆盖。

SDK 不判断业务成功失败

SDK 只接受并校验最终状态值：

success
failed
partial_success

具体什么时候算成功、失败或部分成功，由调用方或爬虫 adapter 根据业务上下文决定。例如 Mercari 试点会根据 Scrapy 的 reason、抓取数量、入库数量、是否允许空结果等信息计算最终状态。

API

开始一次运行

import crawler_monitor_sdk

execution_id = crawler_monitor_sdk.start_run(
    "pokemon.crawler.mercari.sold",
    logger=logger,
)

这会构造并发送一个 run_start 事件，并返回本次运行的 execution_id。

完成一次运行

crawler_monitor_sdk.complete_run(
    "pokemon.crawler.mercari.sold",
    execution_id,
    "success",
    raw_count=100,
    storage_count=100,
    duration_seconds=92.4,
    details={"reason": "finished"},
    logger=logger,
)

这会构造并发送一个 run_complete 事件。

参数说明：

参数	说明
`service_name`	爬虫服务名，例如 `pokemon.crawler.mercari.sold`
`execution_id`	本次运行 ID，必须使用 `start_run()` 返回值或调用方生成的同一运行 ID
`status`	最终状态，只能是 `success`、`failed`、`partial_success`
`raw_count`	抓取到的原始数据数量，非负整数
`storage_count`	成功入库的数据数量，非负整数
`duration_seconds`	本次运行耗时，非负数字
`details`	爬虫侧补充信息，必须可以 JSON 序列化
`logger`	可选 logger，用于记录 SDK 内部 warning/exception

上报错误

crawler_monitor_sdk.report_error(
    "pokemon.crawler.mercari.sold",
    execution_id,
    "request timed out",
    target_url="https://api.mercari.jp/items/get?id=123",
    details={"operation": "fetch_detail", "item_id": "123"},
    logger=logger,
)

这会构造并发送一个 error 事件。

错误事件不会自动结束运行。调用方仍然应该在爬虫结束时调用 complete_run()，并根据整体运行结果决定最终状态。

手动 heartbeat

crawler_monitor_sdk.heartbeat(
    "pokemon.crawler.mercari.sold",
    execution_id,
    details={"page": 12},
    logger=logger,
)

heartbeat() 会发送 run_heartbeat 事件。

当前 Mercari MVP 不启用定时 heartbeat；是否由 SDK、Scrapy adapter、scheduler 或独立 watcher 负责定时心跳，仍在 Todo.md 中作为待决策事项保留。

事件字段

SDK 会自动填充这些字段：

字段	来源
`event_id`	SDK 生成的 UUID
`execution_id`	`start_run()` 生成，或调用方显式传入
`service_name`	调用方传入
`platform`	`CRAWLER_MONITOR_PLATFORM`
`event_type`	SDK 根据调用的 API 设置
`host`	当前机器 hostname
`pid`	当前进程 PID
`timestamp`	UTC ISO 时间，形如 `2026-06-10T00:00:00.000000Z`

不同事件类型还会带上对应字段：

事件类型	额外字段
`run_start`	无
`run_heartbeat`	`details`
`run_complete`	`status`、`raw_count`、`storage_count`、`duration_seconds`、`details`
`error`	`error_message`、`target_url`、`details`

本地校验

SDK 发送前会做本地校验：

必填字符串不能为空：event_id、execution_id、service_name、platform、event_type、timestamp
run_start 和 run_heartbeat 不能带 status
run_complete 的 status 只能是 success、failed、partial_success
error 必须带非空 error_message
raw_count、storage_count、pid 必须是非负整数
duration_seconds 必须是非负数字
target_url、host 必须是字符串
details 必须可以 JSON 序列化

校验失败时，事件会被丢弃；如果传入了 logger，SDK 会写 warning。

details 的使用

monitor-system-v2 对顶层事件字段有固定 schema。爬虫自己的上下文不要新增顶层字段，应放到 details 里。

例如：

details={
    "reason": "finished",
    "mode": "full",
    "item_id": "m123",
}

Mercari 试点建议

Mercari 当前可以继续通过已有的 heartbeat.py 作为 adapter 调用 SDK：

spider 启动时调用 start_run()，保存 execution_id
详情请求等局部错误调用 report_error()，继续使用同一个 execution_id
spider 关闭时根据运行摘要计算 success / failed / partial_success
调用 complete_run() 上报最终结果

这样可以保持 Mercari 现有调用入口基本不变，同时把新事件推到 monitor v2 ingest API。

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawler_monitor_sdk-0.1.0.tar.gz (42.4 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawler_monitor_sdk-0.1.0-py3-none-any.whl (7.6 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file crawler_monitor_sdk-0.1.0.tar.gz.

File metadata

Download URL: crawler_monitor_sdk-0.1.0.tar.gz
Upload date: Jun 11, 2026
Size: 42.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for crawler_monitor_sdk-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`045dfae5905e4738a1108b806203a6791372c29c54afde8b7bffc55110fd6427`
MD5	`fab95f2a16841c321c7f042a09643921`
BLAKE2b-256	`0f3f26dafd728154a14e01d917e1d005b1e9dbd785253d62787efc53a176117e`

See more details on using hashes here.

File details

Details for the file crawler_monitor_sdk-0.1.0-py3-none-any.whl.

File metadata

Download URL: crawler_monitor_sdk-0.1.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 7.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for crawler_monitor_sdk-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`029677407c0c3fa3aed0c447dde910d93728dc803c6c9baddd29371b28272510`
MD5	`bbae0ca57c47f2061538a2e0b4e830b8`
BLAKE2b-256	`d5f2d42a5a73a5af84cba79c46530130c52501b78aeefcda46aad47194bd2d11`

See more details on using hashes here.

crawler-monitor-sdk 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

crawler-monitor-sdk

安装

环境变量

设计原则

Best-effort 上报

先生成 execution_id

SDK 不缓存运行状态

SDK 不判断业务成功失败

API

开始一次运行

完成一次运行

上报错误

手动 heartbeat

事件字段

本地校验

details 的使用

Mercari 试点建议

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes