Evaluation-Driven Development toolkit for OpenClaw agents
Project description
openclaw-edd
Evaluation-Driven Development toolkit for OpenClaw agents
零摩擦 EDD 工具包,不侵入 OpenClaw 本体,唯一数据源是日志文件。
特性
- ✅ 零配置 -
pip install openclaw-edd && edd watch即用 - ✅ 零侵入 - 不需要改 OpenClaw 配置,不需要重启 Gateway
- ✅ 零依赖 - 核心功能无需任何外部库(PyYAML 可选)
- ✅ 完整闭环 - watch → run → suggest → apply → diff → mine → export
快速开始
# 安装
pip install openclaw-edd
# 实时观测
edd watch
# 标准 EDD 闭环
edd run --cases cases.yaml --output-json report_v1.json
edd edd suggest --report report_v1.json > suggestion.txt
edd edd apply --suggestion-file suggestion.txt
edd run --cases cases.yaml --output-json report_v2.json
edd edd diff --before report_v1.json --after report_v2.json
# 从生产日志挖掘 golden cases
edd edd mine --output mined_cases.yaml
edd run --cases mined_cases.yaml --output-json regression.json
# 导出 golden dataset
edd edd export --output golden.jsonl
edd edd export --format csv --output review.csv
命令列表
核心命令
- watch - 实时监听日志,打印 tool 事件流
- trace - 回放历史事件链
- state - 查看/修改 session 状态
- artifacts - 管理 tool 输出文件
- sessions - 列出/查看历史 session
- run - 运行评测用例集
- gen-cases - 生成用例模板
EDD 闭环命令
- edd suggest - 从失败 cases 生成修改建议
- edd apply - 应用建议到 workspace
- edd diff - 对比两次 run 的变化
- edd mine - 从历史日志挖掘 golden cases
- edd judge - 用 LLM 对 tool 选择和 output 质量打分
- edd export - 导出 golden dataset(JSONL/CSV)
详细用法
watch - 实时监听
# 基本用法
edd watch
# 过滤特定 session
edd watch --session <session_id前缀>
# 从文件头读(回放今天历史)
edd watch --from-start
# 后台运行
edd watch --daemon
kill $(cat /tmp/openclaw_edd_watch.pid)
run - 运行评测
# 使用内置用例
edd run
# 使用自定义用例
edd run --cases cases.yaml
# 过滤 tags
edd run --cases cases.yaml --tags smoke,mysql
# 单个用例(命令行指定)
edd run --case "今天上海天气" --expect-tools get_weather
# 显示详细的工具调用 trace
edd run --cases cases.yaml --show-trace
# 使用 --local 模式(确保日志写入本地)
edd run --cases cases.yaml --agent main --local
# 输出报告
edd run --output-json report.json
edd run --output-html report.html
# Dry run(不发消息,只解析日志)
edd run --dry-run
edd suggest - 生成建议
edd edd suggest --report report.json
edd edd suggest --report report.json --workspace ~/.openclaw/workspace
edd edd suggest --report report.json > suggestion.txt
edd diff - 对比变化
edd edd diff --before report_v1.json --after report_v2.json
edd edd diff --before report_v1.json --after report_v2.json --format json
edd mine - 挖掘用例
edd edd mine
edd edd mine --output mined_cases.yaml
edd edd mine --min-tools 2
edd judge - LLM 评估
# 使用 LLM 对测试结果进行智能评估
export ANTHROPIC_API_KEY=your_key
edd edd judge --report report.json
edd edd judge --report report.json --output judged_report.json
edd edd judge --report report.json --model claude-opus-4-6
# 查看详细文档
cat docs/JUDGE_COMMAND.md
edd export - 导出 dataset
# 导出 JSONL
edd edd export --output golden.jsonl
# 结合 run report 补充更准确的 golden_output
edd run --cases cases.yaml --output-json report.json
edd edd export --merge-report report.json --output golden.jsonl
# 导出 CSV 给专家人工审查
edd edd export --format csv --output review.csv
用例格式
cases:
- id: mysql_slow_query
message: "MySQL 最近有慢查询吗"
eval_type: regression # "regression" (防退步) | "capability" (能力爬坡),默认 regression
expect_tools:
- query_metrics
- get_alerts
expect_tools_ordered:
- query_metrics
- get_alerts
expect_output_contains:
- "慢查询"
forbidden_tools:
- execute_sql
expect_tool_args: # 工具参数断言(White-box 评测)
query_metrics:
time_range: "1h" # 精确匹配:实际调用必须包含此参数且值相等
metric: "p99_latency" # 未指定的参数不检查
agent: openclaw_agent
timeout_s: 30
tags: [mysql, sre]
description: "MySQL 慢查询排查基础验证"
Eval Type 说明
- regression: 防退步评测,从接近 100% 开始,任何下降都是报警信号
- capability: 能力爬坡评测,从低通过率开始,用来测试 agent 还不会做的事
运行报告会按 eval_type 分组显示:
📊 Regression Eval(防退步)
通过: 8/10 (80%) ← 低于 100% 需要关注
FAIL: mysql_slow_query, mysql_alert_check
📈 Capability Eval(能力爬坡)
通过: 3/8 (37.5%) ← 正常,这是爬坡指标
PASS: mysql_basic_query ...
Golden Dataset 格式
{
"id": "50a359b5_1",
"description": "从 session 50a359b5 提取,2026-02-28",
"source": "mined",
"tags": ["mined"],
"conversation": [
{
"turn": 1,
"user": "MySQL 最近有慢查询吗",
"golden_tool_sequence": [
{
"name": "query_metrics",
"args": {"metric": "p99_latency", "time_range": "1h"},
"output_summary": "P99 延迟 120ms,超过阈值"
}
],
"golden_output": "检测到 MySQL 慢查询,P99 延迟 120ms",
"assert": [
{"type": "tool_called", "value": "query_metrics"},
{"type": "tool_args", "tool": "query_metrics", "args": {"metric": "p99_latency", "time_range": "1h"}},
{"type": "contains", "value": "慢查询"}
]
}
],
"metadata": {
"session_id": "50a359b5-184f-4c73-913d-3b53ebbdf109",
"agent": "openclaw_agent",
"extracted_at": "2026-02-28T16:00:00",
"skill_triggered": "skills/mysql_sre.md"
}
}
数据源
- 日志位置:
/tmp/openclaw/openclaw-YYYY-MM-DD.log - 格式: JSON Lines
- State:
~/.openclaw_eval/state/<session_id>.json - Artifacts:
~/.openclaw_eval/artifacts/<session_id>/
Workspace 路径解析
优先级:
--workspace参数~/.openclaw/openclaw.json→agents.defaults.workspace- Fallback:
~/.openclaw/workspace
依赖说明
- 零强制依赖 - 核心功能无需任何外部库
- 可选依赖:
- PyYAML(仅在使用
--cases时需要) - anthropic(仅在使用
edd judge时需要)
- PyYAML(仅在使用
pip install openclaw-edd[yaml] # 安装 YAML 支持
pip install openclaw-edd # 包含 anthropic SDK
平台支持
- Linux/macOS - 完整支持(包括 daemon 模式)
- Windows - 支持除 daemon 外的所有功能
CI 集成
# 运行评测
edd run --cases cases.yaml --output-json report.json
# 检查退出码
if [ $? -ne 0 ]; then
echo "评测失败"
exit 1
fi
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
openclaw_edd-0.1.1.tar.gz
(35.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openclaw_edd-0.1.1.tar.gz.
File metadata
- Download URL: openclaw_edd-0.1.1.tar.gz
- Upload date:
- Size: 35.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
880a3b2a209c8328403c7527195a39be2224e059fa3c70009b25b2db66c34098
|
|
| MD5 |
d495209ef3138695e3ed60bbc2ffd0c9
|
|
| BLAKE2b-256 |
df1d3593c9a40d2d31de0e685379045b70b2e0fcfeb40d0039eeb345b247e285
|
File details
Details for the file openclaw_edd-0.1.1-py3-none-any.whl.
File metadata
- Download URL: openclaw_edd-0.1.1-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5b47e914da4c2245e1ed668d2bfa723d7404e59649b3db5650f7b1560b356a4
|
|
| MD5 |
7c8882fe931347d87259c7309ca66a17
|
|
| BLAKE2b-256 |
d5d7f5b4c2cbadd7e81d54de112dbbc564bf8403ea83879b1cbda527072dc016
|